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Introduction 

Malaria  is  caused  by  apicomplexan  parasites  of  the  genus  Plasmodium.  It  is  a  major 
public  health  problem  in  many  tropical  areas  of  the  world,  and  also  affects  many  individuals  and 
military  forces  that  visit  these  areas.  In  1994  the  World  Health  Organization  estimated  that  there 
were  300-500  million  cases  and  up  to  2.7  million  deaths  caused  by  malaria  each  year,  and 
because  of  increased  parasite  resistance  to  chloroquine  and  other  antimalarials  the  situation  is 
expected  to  worsen  considerably  *.  These  dire  facts  have  stimulated  efforts  to  develop  an 
international,  coordinated  strategy  for  malaria  research  and  control 2  Development  of  new  drugs 
and  vaccines  against  malaria  will  undoubtedly  be  an  important  factor  in  control  of  the  disease. 
However,  despite  recent  progress,  drug  and  vaccine  development  has  been  a  slow  and  difficult 
process,  hampered  by  the  complex  life  cycle  of  the  parasite,  a  limited  number  of  drug  and 
vaccine  targets,  and  our  incomplete  understanding  of  parasite  biology  and  host-parasite 
interactions. 

The  advent  of  microbial  genomics,  i.e.  the  ability  to  sequence  and  study  the  entire 
genomes  of  microbes,  should  accelerate  the  process  of  drug  and  vaccine  development  for 
microbial  pathogens.  As  pointed  out  by  Bloom,  the  complete  genome  sequence  provides  the 
“sequence  of  every  virulence  determinant,  every  protein  antigen,  and  every  drug  target”  in  an 
organism  3,  and  establishes  an  excellent  starting  point  for  this  process.  In  1995,  an  international 
consortium  including  the  National  Institutes  of  Health,  the  Wellcome  Trust,  the  Burroughs 
Wellcome  Fund,  and  the  US  Department  of  Defense  was  formed  (Malaria  Genome  Sequencing 
Project)  to  sequence  the  genome  of  the  human  malaria  parasite  Plasmodium  falciparum,  and 
later,  a  second,  yet  to  be  determined,  species  of  Plasmodium.  Another  major  goal  of  the 
consortium  was  to  foster  close  collaboration  between  members  of  the  consortium  and  other 
agencies  such  as  the  World  Health  Organization,  so  that  the  knowledge  generated  by  the  Project 
could  be  rapidly  applied  to  basic  research  and  antimalarial  drug  and  vaccine  development 
programs  worldwide. 

This  report  describes  progress  in  the  Malaria  Genome  Sequencing  Project  achieved  by 
The  Institute  for  Genomic  Research  and  the  Malaria  Program  Naval  Medical  Research  Center, 
under  Cooperative  Research  Agreement  DAMD 17-98-2-8005,  over  the  12  month  period  from 
Dec.  ’97  to  Dec  ’98.  The  specific  aims  of  the  work  covered  under  this  cooperative  agreement 
were  to: 

1.  Determine  the  sequence  of  3.5  megabases  of  the  P.  falciparum  genome  (clone 

3D7): 

a)  Construct  small-insert  shotgun  libraries  (1-2  kb  inserts)  of  chromosomal  DNA  isolated 
from  preparative  pulsed-field  gels. 

b)  Sequence  a  sufficiently  large  number  of  randomly  selected  clones  from  a  shotgun 
library  to  provide  10-fold  coverage  of  the  selected 

c)  Construct  PI  artificial  chromosome  (PAC)  libraries  (inserts  up  to  20  kb)  of 
chromosomal  DNA  isolated  from  preparative  pulsed-field  gels. 
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d)  If  necessary,  generate  additional  STS  markers  for  the  chromosome  by  i)  mapping 
unique-sequence  contigs  derived  from  assembly  of  the  random  sequences  to  chromosome,  ii) 
mapping  end-sequences  from  chromosome-specific  PAC  clones  to  YACs. 

e)  Use  TIGR  Assembler  to  assemble  random  sequence  fragments,  and  order  contigs  by 
comparison  to  the  STS  markers  on  each  chromosome. 

f)  Close  any  remaining  gaps  in  the  chromosome  sequence  by  PCR  and  primer-walking 
using  P.  falciparum  genomic  DNA  or  the  YAC,  BAC,  or  PAC  clones  from  each  chromosome  as 
templates. 

2.  Analyze  and  annotate  the  genome  sequence: 

a)  employ  a  variety  of  computer  techniques  to  predict  gene  structures  and  relate  them  to 
known  proteins  by  similarity  searches  against  databases;  identify  untranslated  features  such  as 
tRNA  genes,  rRNA  genes,  insertion  sequences  and  repetitive  elements;  determine  potential 
regulatory  sequences  and  ribosome  binding  sites;  use  these  data  to  identify  metabolic  pathways 
in  P.  falciparum. 

3.  Establish  a  publicly-accessible  P.  falciparum  genome  database  and  submit 
sequences  to  GenBank. 

We  are  pleased  to  report  that  despite  encountering  formidable  technical  challenges, 
excellent  progress  has  been  made  towards  achievement  of  these  goals.  A  major  milestone  in 
malaria  research  was  achieved  by  the  TIGR/NMRG  group  with  the  publication  in  Science  of  the 
first  complete  sequence  of  a  malarial  chromosome  (chromosome  2).  A  P.  falciparum  genome 
web  site  was  also  established  at  TIGR  which  contains  all  of  the  published  sequence  data  and 
annotation,  as  well  as  preliminary  data  for  other  chromosomes  currently  being  sequenced 
(http://www.tigr.org/tdb/mdb/pfdb/pfdb.html).  In  the  course  of  sequencing  chromosome  2,  the 
TIGR/NMRC  team  collaborated  with  other  groups  in  the  development  of  optical  restriction 
mapping  technology  for  rapid  mapping  of  whole  Plasmodium  chromosomes,  construction  of  a 
chromosomome  2  YAC  map,  development  of  a  Plasmodium  gene  finding  program,  GlimmerM, 
and  construction  of  P.  falciparum  PAC  libraries.  In  addition,  we  initiated  sequencing  of  3  other 
P.  falciparum  chromosomes,  and  provided  chromosomal  DNA  for  sequencing  to  other 
laboratories  in  the  consortium.  Finally,  in  accordance  with  a  modification  of  the  Cooperative 
Agreement,  we  initiated  studies  using  microarray  technology  to  gather  information  relating  to  the 
function  of  novel  Plasmodium  genes  identified  by  the  sequencing  effort. 

Results 

Complete  nucleotide  sequence  of  P.  falciparum  chromosome  2  (Specific  Aims  1,2,3) 

Although  sequencing  of  the  AT-rich  Plasmodium  genome  presented  unique  challenges, 
determination  of  the  complete  genome  sequences  of  several  microbes  suggested  that  technology 
had  matured  to  the  point  that  sequencing  of  the  P.  falciparum  genome  was  feasible.  Several 
groups  began  working  towards  this  goal,  and  an  international  consortium  was  formed  to  fund  and 
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coordinate  the  project 4,5.  As  part  of  this  effort,  we  sequenced  chromosome  2  of  P.  falciparum 
clone  3D7  using  a  chromosome-specific  shotgun  sequencing  strategy.  This  approach  was 
selected  to  avoid  the  computational  and  gap  closure  problems  that  would  arise  from  application 
of  a  whole-genome  random  shotgun  strategy  to  an  AT-rich  30  Mb  genome  with  current 
technology.  In  addition,  large  insert  genomic  libraries  suitable  for  a  directed  sequencing  strategy 
were  not  available  because  large  fragments  of  P.  falciparum  DNA  are  prone  to  deletion  and 
rearrangement  in  E.  coli.  Small  insert  shotgun  libraries  were  prepared  to  minimize  the 
probability  of  rearrangments,  and  the  techniques  used  were  designed  to  avoid  UV  damage  and 
melting  of  AT-rich  DNA. 

P.  falciparum  clone  3D7  was  chosen  for  sequencing  because  it  can  complete  all  stages  of 
the  life  cycle,  was  used  in  a  genetic  cross  6,  and  had  been  used  in  the  Wellcome  Trust  Malaria 
Genome  Mapping  Project  7.  Parasites  were  grown  in  vitro  8,  and  parasites  released  from  host 

cells  by  acetic  acid  lysis  were  embedded  in  agarose  9  Chromosomes  were  resolved  on 
preparative  pulsed  field  gels  (1.2%  SeaPlaque  GTG  agarose,  BioRad  DRIII  apparatus,  180-250 
sec  switch  time,  120  field  angle,  3.7  V/cm  for  90  hours  at  14  °C),  and  the  chromosome  2  bands 
from  5  gels  were  excised,  adjusted  to  0.3  M  sodium  acetate  to  prevent  melting  of  the  AT-rich 
DNA,  and  digested  with  agarase.  Exposure  to  UV  light  was  minimized  to  prevent  DNA  damage. 
The  DNA  was  sheared  by  nebulization  and  a  shotgun  library  was  prepared  in  pUCl  8  as 
described  10,  except  that  treatment  with  E.  coli  DNA  polymerase  I  was  performed  (0.5  mM 
dNTPs,  16  °C  for  10  minutes)  after  the  second  ligation  step  to  close  nicks  prior  to 
electroporation.  The  ligation  mixtures  were  stored  at  -20  °C,  and  aliquots  were  electroporated 
into  DH10B  cells  and  spread  on  ampicillin  diffusion  plates.  The  shotgun  library  contained  1  x 

105  recombinants  and  had  an  average  insert  size  of  1.6  kb. 

Initial  sequencing  was  done  with  dye-primer  chemistry  used  previously  to  sequence  H. 
influenzae  and  the  other  microbial  genomes.  However,  when  sequencing  P.  falciparum  clones 
we  observed  an  apparent  artifact  with  the  dye-primer  chemistry  that  resulted  in  runs  of  G 
nucleotide  base  calls  to  be  incorrectly  made  following  long  runs  of  AT-rich  sequence.  The 
artifact  did  not  occur  when  FS+  dye-terminator  chemistry  was  used  on  the  same  template  DNAs. 
In  addition,  the  dye-terminator  chemistry  produced  significantly  longer  sequence  reads  than  the 
dye-primer  chemistry.  The  rest  of  the  random-phase  sequencing  was  subsequently  performed 
using  the  dye-terminator  chemistry.  Because  the  gel-purified  chromosome  2  DNA  was  only 
-85%  pure  due  to  co-migration  of  chromosome  2  DNA  with  sheared  DNA  from  other 
chromosomes,  and  to  provide  excess  coverage  to  compensate  for  the  expected  non-randomness 
of  the  shotgun  library,  23,768  sequences  (equivalent  to  -10X  coverage)  were  obtained  during  the 
random  sequencing  phase. 

Sequences  were  assembled  using  a  version  of  TIGR  Assembler  that  had  been  extensively 
modified  to  assemble  the  AT-rich  and  repeat-rich  Plasmodium  sequences.  Two  modifications  to 
TIGR  Assembler  1 1  reduced  assembly  time  without  sacrificing  accuracy.  TIGR  Assembler 
identifies  and  aligns  overlapping  fragments  in  two  steps.  The  initial  step  in  assembly  is  to  locate 
all  n-mer  oligonucleotides  shared  between  fragment  pairs.  The  software  views  all  fragment  pairs 
with  a  high  degree  of  n-mer  similarity  as  potentially  overlapping,  and  in  the  second  step  the 
Smith- Waterman  method  is  used  to  align  the  fragments.  In  the  bacterial  genome  projects  the 
value  of  n  used  was  typically  10-12  nucleotides.  However,  using  n=10  with  AT-rich  Plasmodium 
DNA  resulted  in  incorrect  identification  of  thousands  of  potential  fragment  overlaps,  so  that  the 
program  spent  an  inordinate  amount  of  time  attempting  to  align  the  spurious  matches.  Increasing 
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n  from  10  to  32  greatly  minimized  this  problem.  In  addition,  to  prevent  false  merges  during  the 
alignment  step,  TIGR  Assembler  was  modified  to  ignore  32-mers  that  were  over-represented  in 
the  data  set. 

Six  hundred  and  ten  contigs  were  obtained  and  the  largest  contig  was  50  kb.  Neighboring 
contigs  were  identified  and  ordered  by  the  program  GROUPER,  which  searches  for  plasmid 
templates  with  forward  and  reverse  reads  in  different  contigs  (clone  links),  and  for  contigs  with 
sequence  similarity  at  their  termini  (grasta  links).  Contigs  within  a  group  are  separated  by 
sequence  gaps  (gaps  which  can  be  closed  by  primer  walking  on  the  templates  identified  as  clone 
links,  or  by  editing  of  the  termini  of  contigs  with  grasta  links),  and  contigs  on  the  ends  of  groups 
mark  the  ends  of  physical  gaps  (gaps  for  which  no  shotgun  clone  has  been  identified).  Ten 
groups  of  1 14  contigs  were  localized  on  the  chromosome  by  comparison  to  STS  markers 
Closure  of  physical  and  sequence  gaps  used  approaches  described  previously,  with  a  few 
modifications  to  compensate  for  the  AT-richness  of  the  DNA.  To  close  the  9  physical  gaps  in  the 
central  region  of  the  chromosome,  primers  were  synthesized  to  the  ends  of  all  groups  longer  than 
2.5  kb.  PCR  reactions  using  genomic  DNA  as  template  were  performed  with  primers  from 
adjacent  groups.  PCR  reactions  were  performed  with  the  Expand  Long  Template  PCR  System 
(Boerhinger  Mannheim  1681  842).  PCR  reactions  contained  100  ng  of  genomic  DNA  and  15 
pmol  primer  (BioServe  Technologies)  in  a  final  volume  of  50  pi.  Cycling  conditions  were  94  °C 
for  2  min,  followed  by  10  cycles  of  94  °C  for  1  min,  50  or  55  °C  for  1  min,  and  60  °C  for  2  min, 
20  cycles  of  94  °C  for  1  min,  50  or  55  °C  for  1  min,  and  60  °C  for  2  min  plus  20  sec  per  cycle, 
and  1  cycle  at  60  °C  for  10  min.  The  60  °C  extension  temperature  was  neccessary  for 

amplification  of  AT-rich  P.  falciparum  DNA  13.  All  PCR  reactions  were  done  in  96-well  format 
on  Perkin  Elmer  GeneAmp  PCR  Systems  9600  or  9700.  PCR  products  were  purified  using  the 
QIAquick  PCR  Purification  Kit  (Qiagen  28104)  and  sequenced  using  dye-terminator 
chemistry  .This  process  closed  3  physical  gaps  immediately,  but  PCR  products  from  2  gaps 
contained  very  AT-rich  sequence  which  could  not  be  sequenced  completely,  and  remained  as 
sequence  gaps.  Those  physical  gaps  for  which  PCR  products  could  not  be  obtained  in  the  first 
step  were  reasoned  to  be  too  large  for  PCR,  and  to  contain  one  or  more  of  the  unlocalized 
groups.  We  therefore  performed  combinatorial  PCRs  with  one  primer  from  the  end  of  a  localized 
group  and  the  second  primer  from  the  ends  of  free  groups.  Two  gaps  were  closed  by  the 
combinatoral  strategy.  Finally,  1  physical  gap  was  closed  after  editing  and  reassembly,  and 
another  gap  was  closed  by  sequencing  a  “missing  mate:”  a  plasmid  was  identified  which  had  one 
read  pointing  off  the  end  of  the  contig,  but  for  which  the  opposite  read  that  should  fall  in  the  gap 
had  failed  in  sequencing.  This  template  was  resequenced  and  the  new  sequence  provided 
sufficient  information  to  close  the  gap.  Five  methods  were  used  to  close  sequence  gaps.  For 
contigs  which  overlapped  but  had  not  been  merged  during  assembly,  editing  and  resequencing 
were  performed  to  close  the  gaps.  Also,  many  sequence  gaps  were  caused  by  artifacts  in  dye- 
primer  reactions,  particularly  in  extremely  AT-rich  areas.  These  artifacts  either  prevented 
merging  of  overlapping  contigs,  or  resulted  in  short  sequences  that  did  not  extend  to  the 
neighboring  contig.  Templates  from  short  or  low-quality  dye-primer  reactions  in  the  vicinity  of 
sequence  gaps  were  identified  and  resequenced  with  dye-terminator  chemistry;  the  high-quality 
sequence  provided  by  the  dye-terminator  reactions  was  sufficient  to  close  many  gaps.  For  those 
gaps  that  remained,  primer  walking  on  plasmid  templates  linking  adjacent  contigs  was  used. 
There  were  5  sequence  gaps  that  could  not  be  closed  by  the  above  methods  because  the  sequence 
was  too  AT-rich  for  primer  synthesis  and  walking.  To  close  these  gaps,  the  artificial  transposon 
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AT-2  14  was  inserted  into  one  of  the  templates  spanning  each  sequence  gap,  multiple  subclones 
of  each  template  were  sequenced,  and  the  sequences  were  assembled  to  close  the  gap. 

The  coverage  criteria  were  that  every  position  required  at  least  double-clone  coverage  (or 
sequence  from  a  PCR  product  amplified  from  genomic  DNA),  and  either  sequence  from  both 
strands  or  with  two  different  sequencing  chemistries.  These  criteria  ensured  accuracy  of  the 
assembly  and  base  calls.  The  sequence  was  edited  manually  using  TIGR  Editor,  and  additional 
sequencing  reactions  were  performed  to  improve  coverage  and  resolve  sequence  ambiguities. 

Verification  of  the  assembly. 

The  widely  reported  instability  of  P.  falciparum  DNA  in  E.  coli  prompted  concern  as  to 
the  ability  to  clone,  sequence,  and  accurately  assemble  P.  falciparum  genomic  DNA.  The 
coverage  criteria  used,  particularly  the  requirement  for  double  clone  coverage  at  each  position, 
ensured  that  rearrangement  or  chimerism  of  a  single  clone  did  not  lead  to  incorrect  assembly  of 
contigs.  To  independently  confirm  the  colinearity  of  the  assembled  sequence  and  genomic  DNA, 
Nhel  and  BamKl  optical  restriction  maps  of  chromosome  2  DNA  were  prepared  in  collaboraton 
with  Dr.  David  Schwartz  at  New  York  University  (see  Optical  Mapping  of  P.  falciparum 
chromosomes)  and  compared  with  restriction  maps  predicted  from  the  sequence.  The  relative 
error  of  predicted  and  observed  fragment  sizes  was  4.3%  and  5.8%  for  the  Nhel  and  BamUi 
maps,  respectively.  The  correspondence  between  the  two  data  sets  showed  that  there  were  no 
major  rearrangements  in  the  assembled  sequence.  Further  proof  of  colinearity  was  obtained  by 
the  comparison  of  the  in  silicio  data  to  a  scaffold  of  YAC-end  sequences  from  chromosome  2- 
specific  YACs  isolated  from  a  3D7  YAC  library  (see  Construction  of  chromosome  2  YAC  map). 

General  features. 

Chromosome  2  of  P.  falciparum  (clone  3D7)  is  947  kb  in  length,  and  has  an  overall  base 
composition  of  80.2%  A+T,  in  agreement  with  previous  estimates  for  the  P.  falciparum  genome. 
The  chromosome  contains  a  large  central  region  encoding  single-copy  and  several  tandemly 
repeated  genes,  subtelomeric  regions  containing  variant  antigen  genes  (var),  RIF-1  elements 
(repetitive  interspersed  family)  15  and  other  repeats,  and  typical  eukaryotic  telomeres  (Figure  1). 
The  terminal  23  kb  portions  of  the  chromosome  are  non-coding  and  exhibit  77%  identity  in 
opposite  orientations.  The  left  and  right  telomeres  consist  of  tandem  repeats  of  the  sequence 
TT(TC)AGGG  ^  totaling  1141  and  551  nt,  respectively.  The  sub-telomeric  regions  do  not 
appear  to  be  composed  of  repeat  oligomers  until  approximately  12-20  kb  internal  to  the 
chromosome,  where  a  sequence  primarily  composed  of  a  previously  reported  21  bp 
tandem  repeat  (rep20)  with  the  consensus  ACTAANNTAGGTCTTANNNT  was  found.  This 
tandem  repeat  is  found  exclusively  in  these  regions  and  occurs  134  and  96  times  in  the  left  and 
right  portions  of  the  chromosome,  respectively.  The  chromosome  2  sequence  was  inspected  for 
repeats  of  similar  abundance  and  density;  no  other  non-coding  DNA  repeats  similar  to  rep20 
were  found.  One  region  occuring  in  the  coding  portion  of  gene  PFG0915w  (RESA-H3,  ^) 
generated  36  copies  of  the  peptide  consensus  VEE{IS}V  {AVE}  {PE}  {STN}.  A  more  complex 
region  composed  of  sequences  that  translate  into  106  tandemly  repeated  subunits  of  4  distinct 
polypeptide  repeats  that  range  in  size  of  12-21  amino  acids  was  found  in  gene  PFG0095c 

(PfEMP3  20). 

A  region  with  centromere  functions  could  not  be  identified  based  on  sequence  similarity 
to  S.  cerevisiae  centromeres  or  other  eukaryotic  centromeres.  However,  there  were  several 
regions  of  up  to  12  kb  that  were  devoid  of  large  open  reading  frames  and  which  might  contain 
the  centromere.  Alternatively,  mitotic  and  meiotic  centromeric  functions  may  be  defined  by  a 
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specific  higher-order  DNA  structure  and  chromatin-associated  protein  complexes  as  in  other  21 . 
In  addition  to  the  apparent  lack  of  DNA  sequences  identifiable  as  centromeric,  we  were  unable  to 
identify  sequences  that  may  be  involved  in  the  initiation  of  chromosome  replication.  Although 
little  is  known  about  chromosomal  replication  in  Plasmodium,  it  is  expected  to  involve  many 
origins  of  replication  as  in  S.  cerevisiae,  which  is  the  only  eukaryote  in  which  replication  origins 

have  been  clearly  defined  22. 

Analysis  of  predicted  coding  sequences. 

The  non-redundant  (NR)  protein  sequence  database  at  the  National  Center  for 
Biotechnology  Information  (NIH,  Bethesda)  was  searched  using  the  gapped  BLAST  program.  If 
necessary,  database  searches  were  reiterated  using  the  PSI-BLAST  program,  which  constructs 
position-dependent  weight  matrices  on  the  basis  of  alignments  generated  by  BLAST  and 
employs  them  for  subsequent  search  iterations.  Potential  coding  regions  were  predicted  using 
GlimmerM,  a  eukaryotic  gene-finding  program  based  on  2^.  GlimmerM,  was  trained  on  a  set  of 
117  P.  falciparum  sequences  taken  from  Genbank.  Gene  models  based  on  Glimmer  predictions, 
similarity  of  ORFs  to  known  proteins,  and  prediction  of  putative  signal  peptides  and 
transmembrane  domains  were  constructed  with  the  Annotator  program  (Lixin  Xhou,  TIGR).  In 
cases  where  a  putative  gene  had  no  database  match  and  multiple  Glimmer  predictions  of  gene 
structure,  the  highest  scoring  model  was  reported.  After  the  first  set  of  models  were  inspected, 
they  were  added  to  the  training  set  and  GlimmerM  was  re-trained.  These  models  should  be 
regarded  as  preliminary  until  confirmed  by  other  methods.  Protein  structural  features  were 
delineated  using  an  hierarchical  scheme  implemented  in  the  UniPred  program  of  the  SEALS 
package  24  Signal  peptides  were  predicted  using  the  SignalP  program  with  the  parameters 
optimized  for  eukaryotic  proteins  2^,  and  transmembrane  helices  were  predicted  using  PHThtm. 
Coiled  coil  domains  were  predicted  using  a  modification  of  the  COILS  program  (John  Kuzio, 
NCBI).  Regions  of  low  complexity  predicted  to  form  non-globular  structures  were  identified 
using  the  SEG  program  with  the  following  parameters:  window  length  45,  trigger  complexity 
3.4,  extension  complexity  3  2^.  Multiple  sequence  alignments  were  constructed  using  the 
CLUSTALW  program  or  the  Gibbs-sampling  option  of  the  MACAW  program.  Transfer  RNAs 
were  identified  with  the  tRNAscan  program  27.  Systematic  gene  names  based  on  a  scheme 
similar  to  that  devised  for  the  S.  cereviseae  genome  28  were  assigned  using  the  convention  PF  ( 
for  P.  falciparum ),  a  letter  for  the  chromosome  (A  for  chromosome  1,  B  for  chromosome  2,  and 
so  forth...),  a  3 -digit  code  ordering  the  genes  from  left-to-right  in  increments  of  5  (to  allow  for 
future  modifications),  and  a  letter  denoting  the  coding  strand  (w  or  c). 

Two  hundred  and  nine  protein-encoding  genes  and  a  gene  for  tRNAGlu  (Fig  1.,  Table  1) 
were  predicted  on  chromosome  2,  giving  a  gene  density  of  one  gene  per  4.5  kb,  which  is 
significantly  lower  than  in  yeast  (one  gene  per  2  kb)  but  somewhat  higher  than  in  C.  elegans  (one 
gene  per  7  kb).  Of  the  209  protein-encoding  genes,  43%  contained  at  least  one  intron.  This 
should  be  considered  an  estimate  as  it  cannot  be  ruled  out  that  some  introns  were  missed  by  the 
current  gene  finding  method,  or  that  some  of  the  non-globular  inserts  detected  in  Plasmodium 
proteins  (see  below)  are  actually  introns.  The  majority  of  the  spliced  genes  consist  of  only  two  or 
three  exons  but  for  two  genes  8  exons  were  predicted.  Thus  in  terms  of  intron  content  and  gene 
density,  the  Plasmodium  genome,  assessed  by  the  analysis  of  the  first  completed  chromosome 
sequence,  appears  to  be  intermediate  between  the  condensed  state  of  the  yeast  genome  and  the 
intron-rich  genomes  of  multicellular  eukaryotes. 
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The  proteins  encoded  in  chromosome  2  (Table  2)  fall  into  three  categories:  i)  proteins 
with  at  least  one  distinct  globular  domain  conserved  in  species  other  than  Plasmodium;  ii) 
proteins  belonging  to  Plasmodium- specific  families  with  identifiable  structural  features  and  in 
some  cases,  known  functions;  iii),  completely  uncharacterized  proteins  which  were 
predominantly  non-globular  and  often  large  (the  term  "non-globular"  refers  to  proteins  or 
domains  of  proteins  that  do  not  assume  compact,  folded  structures).  Homologs  outside 
Plasmodium  were  detected  for  87  out  of  the  209  predicted  proteins  (this  includes  not  only 
proteins  in  the  1st  category  but  also  some  in  the  2nd  category  that  have  unique  domain 
organization  so  far  found  only  in  Plasmodium  but  that  contain  a  conserved  domain).  Most  of  the 
remaining  proteins  were  predicted  to  consist  primarily  of  non-globular  domains  (Table  1).  Thus 
the  low  fraction  of  proteins  containing  conserved  domains  in  Plasmodium  chromosome  2  is  due 
to  the  enrichment  of  the  Plasmodium  genome  with  genes  encoding  predominantly  non-globular 
proteins.  The  abundance  of  non-globular  domains  or  proteins  in  Plasmodium  was  very  unusual; 
the  proportion  of  non-globular  proteins  in  other  eukaryotes  such  as  S.  cerevisiae  (Table  1), 
Caenorhabditis  elegans,  and  humans  is  approximately  half  that  observed  in  Plasmodium. 
Remarkably,  13  predicted  proteins  on  chromosome  2  contained  large  regions  (greater  than  30 
amino  acids)  with  predicted  non-globular  structure  inserted  directly  into  globular  domains,  a 
phenomenon  so  far  unique  to  Plasmodium.  The  non-globular  insertions  did  not  exhibit  the  AT- 
bias  typical  of  introns  and  were  not  flanked  by  consensus  splice  sites. 

To  determine  whether  the  non-globular  domains  were  encoded  in  mRNA,  RT-PCR  was 
performed  with  primers  flanking  the  non-globular  domains  in  1 1  genes  from  chromosome  2, 
using  total  blood  stage  cDNA  as  template.  RT-PCR  was  also  performed  to  assess  expression  of  2 
genes  encoding  predominantly  non-globular  proteins.  In  all  cases  examined,  the  RT-PCR 
product  was  the  same  size  as  the  corresponding  PCR  products  from  genomic  DNA,  and  the 
sequence  of  the  RT-PCR  product  matched  the  genomic  DNA  sequence  (Figure  2A).  Thus  it 
appears  likely  that  most  if  not  all  predicted  non-globular  domains  in  chromosome  2  gene 
products  are  expressed.  One  interesting  example  of  insertion  of  a  non-globular  domain  into  a 
well-defined  globular  domain  is  seen  in  a  5'-3'  exonuclease  (Figure  2B).  Alignment  of  the 
Plasmodium  sequence  with  4  bacterial  exonucleases  revealed  insertion  of  a  176  amino  acid 
sequence  which  which  was  situated  in  a  region  corresponding  to  a  sharp  turn  between  a  strand 
and  helix  in  the  3-dimensional  structure.  These  observations  seem  to  reveal  a  striking  flexibility 
of  eukaryotic  proteins  in  terms  of  accommodating  inserts  that  do  not  impair  protein  function  and 
accordingly,  may  be  excluded  from  the  protein  core  folding.  Structural  analysis  of  Plasmodium 
proteins  containing  non-globular  inserts  may  be  valuable  for  understanding  the  general  principles 
of  protein  folding.  The  propagation  of  non-globular  domains  in  Plasmodium  suggests  that  such 
proteins  provide  specific  selective  advantages  to  the  parasite.  The  preponderance  of  the  predicted 
non-globular  domains  in  Plasmodium  demonstrates  the  remarkable  plasticity  and  adaptability  of 
the  eukaryotic  genome. 

Evolutionarilv  conserved  proteins. 

The  conserved  proteins  encoded  in  chromosome  2  are  a  structurally,  functionally  and 
phylogenetically  diverse  set  (Table  2).  The  majority  of  the  conserved  proteins  show  the  greatest 
similarity  to  eukaryotic  homologs  or  belong  to  specifically  eukaryotic  protein  families.  However, 
15  proteins  were  significantly  more  similar  to  bacterial  than  to  eukaryotic  homologs,  and 
furthermore,  in  4  cases,  analysis  of  the  chromosome  2  sequence  revealed  the  first  eukaryotic 
representative  of  a  protein  family  that  is  conserved  in  bacteria.  These  proteins  may  have  been 
transferred  to  the  nuclear  genome  from  an  organellar,  probably  plastid  genome,  after  the 
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divergence  of  the  apicomplexa  from  the  other  eukaryotic  lineages.  Several  of  these  proteins 
contained  likely  N-terminal  organellar  import  peptides  and  are  predicted  to  function  within  the 
apicoplast  or  the  mitochondrion.  Of  particular  interest  in  this  regard  were  3  genes  encoding 
proteins  involved  in  fatty  acid  metabolism.  3-ketoacyl-ACP  synthase  III  (FabH)  catalyzes  the 
condensation  of  acetyl-CoA  and  malonyl-ACP  in  Type  II  (dissociated)  fatty  acid  synthase 
systems.  Type  II  synthase  systems  are  restricted  to  bacteria  and  the  plastids  of  plants,  confirming 
previous  hypotheses  that  the  Plasmodium  apicoplast  contains  metabolic  pathways  distinct  from 
those  of  the  host  29,30  Two  genes,  apparently  products  of  a  tandem  duplication  event,  encode 
another  enzyme  of  fatty  acid  metabolism,  acyl-CoA  synthetase;  another  member  of  this  protein 
family  encoded  on  a  different  Plasmodium  chromosome  has  been  described  as  an  "octapeptide 
repetitive  antigen",  on  the  basis  of  unique  sequence  features  of  its  predicted  non-globular  domain 
31.  The  expansion  of  the  fatty  acid  biosynthesis  genes  in  Plasmodium  may  suggest  an  as  yet 
uncharacterized  metabolic  process  occurring  in  the  parasite. 

Given  that  the  Apicomplexa  represent  a  deep  branching  in  the  eukaryotic  tree,  the 
presence  of  genes  coding  for  distinct  eukaryotic  proteins  with  conserved  domain  organization  is 
important  to  ascertain  their  origin  early  in  the  evolution  of  eukaryotes.  The  majority  of  these 
genes  code  for  proteins  that  participate  in  replication,  repair,  transcription,  or  translation.  These 
specifically  eukaryotic  genes  that  are  highly  conserved  in  Plasmodium  include  the  origin 
recognition  complex  subunit  5;  the  excision  repair  proteins  ERCC1  and  RAD2;  several  proteins 
involved  in  chromatin  dynamics,  such  as  orthologs  of  the  superfamily  II  helicase  BRAHMA, 
DRING  protein  containing  the  RING  finger  domain,  and  SNW1;  RNA-binding  proteins,  such  as 
the  ortholog  of  Drosophila  DRIBBLE  protein  containing  the  KH  domain,  and  a  small  nuclear 
RNP  protein  containing  the  RRM  domain;  and  2  paralogous  proteins  containing  the  DHHC 
finger  domain.  Furthermore,  several  typical  eukaryotic  proteins  involved  in  secretion  are 
encoded  in  chromosome  2,  such  as  SEC61  g  subunit,  the  coated  pit  coatamer  subunit,  and 
syntaxin,  suggesting  early  emergence  of  the  eukaryotic  secretory  system. 

A  remarkable  feature  detected  in  chromosome  2  is  the  expansion  of  genes  coding  for 
DnaJ-like  domains.  Proteins  of  the  DnaJ  superfamily  act  as  cofactors  for  the  HSP70-type 
molecular  chaperones  and  participate  in  cellular  processes,  such  as  protein  folding  and 
trafficking,  complex  assembly,  organelle  biogenesis,  and  initiation  of  translation  32.  Five 
proteins  containing  DnaJ  domains  are  present  on  chromosome  2,  which  suggests  prominent  roles 
of  this  domain  in  the  Plasmodium  life  cycle.  Two  of  these  consist  primarily  of  the  DnaJ  domain, 
whereas  the  remaining  three  (two  tandem  genes  in  the  right  subtelomeric  region  and  one  in  the 
left  subtelomeric  region)  also  have  a  large  non-globular  domain.  Several  proteins  containing  the 
DnaJ  domain  and  similar  to  the  three  large  DnaJ  domain-containing  proteins  encoded  in 
chromosome  2,  have  been  detected  on  other  chromosomes  ,  indicating  that  this  is  a  large  gene 
family  in  Plasmodium-  One  of  its  members  has  been  described  as  the  ring-infected  erythrocyte 
surface  antigen  (RESA);  in  fact,  this  protein  has  been  shown  to  bind  to  the  cytoplasmic  side  of 
the  erythrocyte  membrane,  which  suggests  that  the  DnaJ  domains  perform  chaperone-like 
functions  in  the  formation  of  specific  protein  complexes  at  this  location  33-37  Interestingly,  the 
DnaJ  domains  in  some  of  the  P.  falciparum  proteins  contain  substitutions  in  the  critical  His-Pro- 
Asp  signature  required  for  interaction  with  HSP-70-type  proteins,  which  may  be  related  to  a 
modification  of  the  typical  chaperone  function.  The  actual  functions  of  the  chaperone-like 
domains  in  Plasmodium  surface  antigens  remain  to  be  determined. 

Chromosome  2  encodes  90  predicted  membrane  proteins,  and  some  of  these  are  obvious 
members  of  distinct  transporter  families,  such  as  amino  acid  and  sugar  transporters,  indicating 
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that  Plasmodium  has  functional  transport  systems  for  a  number  of  metabolites,  but  the  lack  of 
even  a  single  ABC  transporter  is  unexpected  as  these  are  almost  invariably  detected  in  yeast  or 
prokaryotic  genomic  stretches  of  similar  size. 

Another  interesting  feature  of  chromosome  2  is  the  presence  of  genes  for  5  predicted 
protein  kinases  belonging  to  distinct  families  (e.g.  MAP  kinases),  and  a  GAF-domain-containing 
protein  that  may  be  involved  in  cNMP-dependent  signaling.  A  rough  extrapolation  of  the  number 
of  protein  kinases  found  on  chromosome  2  to  the  complete  Plasmodium  genome  suggests  a  total 
of  about  150,  indicating  that  the  expansion  of  protein  kinases  in  Plasmodium  and  accordingly, 
their  role  in  signal  transduction  and  regulation,  are  at  least  as  prominent  in  Plasmodium  as  in 
yeast.  This  prominence  of  regulators  is  in  striking  contrast  to  the  situation  in  bacterial  pathogens, 
which  appear  to  have  shed  most  of  the  regulatory  systems,  and  is  likely  to  be  related  to  the 
complex  life  cycle  of  the  parasite.  Phosphorylation  and  dephosphorylation  indeed  play  an 
important  role  in  the  development  and  sexual  differentiation  of  malaria  38. 

Unique  Plasmodial  protein  families. 

Chromosome  2  encompasses  5  families  of  proteins  that  are  unique  to  Plasmodium  in 
terms  of  their  distinct  domain  organization,  though  three  of  them  contain  domains  conserved  in 
other  species.  The  genes  comprising  these  Plasmodium- specific  families  mostly  concentrate  in 
the  subtelomeric  regions  of  the  chromosome.  The  most  abundant  family  includes  proteins  that 
were  dubbed  rifins,  after  the  RIF-1  repetitive  element.  RIF-1  elements  are  found  on  most 
chromosomes,  are  highly  transcribed  in  blood  stage  parasites,  and  contain  a  1  kb  open  reading 
frame  but  no  initiation  codon  15.  Eighteen  RIF-1  elements  were  found  in  the  subtelomeric 
regions  of  chromosome  2;  two  of  these  appeared  to  be  pseudogenes.  Inspection  of  the  sequence 
upstream  of  each  RIF  revealed  potential  exons  which  encoded  predicted  signal  peptides,  and  RT- 
PCR  with  schizont  RNA  showed  that  1  of  6  rifin  genes  tested  was  transcribed  (data  not 
shown).The  rifin  genes  encode  polypeptides  of  27-35  kD  with  an  extracellular  domain 
containing  conserved  Cys  residues  which  might  participate  in  disulfide  bonding,  a 
transmembrane  segment,  and  a  short  basic  C-terminus  (Figure  3).  Based  on  the  sequence 
conservation  in  the  extracellular  domain,  the  rifins  identified  on  chromosome  2  can  be  grouped 
into  two  subfamilies,  with  a  high  level  of  conservation  within  each  subfamily  but  only  a  limited 
similarity  between  the  subfamilies.  Clusters  of  rifin  genes  similar  to  those  on  chromosome  2 
were  detectable  also  in  the  telomeric  regions  of  chromosomes  3  and  14  (unpublished 
observations).  A  phylogenetic  tree  analysis  of  the  rifin  sequences  indicates  a  direct  relationship 
between  some  of  the  rifin  genes  in  chromosome  2  and  other  chromosomes,  suggesting  that  in  the 
course  of  Plasmodium  evolution,  the  rifin  genes  have  propagated  as  a  cluster  (data  not  shown).  If 
the  number  of  rifins  found  on  chromosome  2  is  representative  of  the  other  chromosomes,  there 
may  be  over  500  rifins  in  the  P.  falciparum  genome  (~7  %  of  all  protein-coding  genes),  making 
it  the  most  abundant  gene  family.  The  location  and  distribution  of  the  rifin  genes  resembles  that 
of  the  var  genes,  which  may  represent  2-3%  of  plasmodial  genes  and  encode  large  proteins 
(PfEMPl)  located  on  the  surface  of  infected  red  cells  that  are  involved  in  antigenic  variation, 
cytoadherence,  and  resetting  39-42  Most  var  genes  are  located  in  subtelomeric  regions,  and  var 
gene  diversity  is  thought  to  be  generated  by  recombination  between  alleles,  a  process  which 
might  be  facilitated  by  the  subtelomeric  repeats  43.  The  abundance  and  co-localization  of  rifins 
with  var  genes  in  subtelomeric  regions  suggests  that  rifins  may  be  involved  in  similar  processes, 
which  is  compatible  with  the  presence  of  a  highly  variable  region  in  the  predicted  extracellular 
domain  of  the  rifins  (Fig.  3).  Work  is  in  progress  to  determine  where  within  the  parasite  and  host 
cell  the  rifins  are  expressed. 
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Two  var  genes  were  identified  on  chromosome  2,  one  in  each  subtelomeric  region  and 
distal  to  the  rifins.  Both  var  genes  had  the  typical  two-exon  structure,  and  encoded  two  Duffy 
binding-like  (DBL)  domains  that  were  most  similar  to  DD2  var-1  DBL  domains  1  and  4.  The 
PFBOOlOw  var  gene  was  unusual  in  that  the  SVL  region  was  much  shorter  than  in  other  known 
var  genes.  In  addition  to  the  two  full-length  var  genes,  6  other  small  ORFs  were  identified  in  the 
subtelomeric  regions  that  had  similarity  to  var  sequences.  Four  of  these  appear  to  be 
pseudogenes,  but  two  others  resemble  the  var  exon  II  cDNAs  reported  previously  40, 

Another  family  of  membrane-associated  proteins,  called  SERAs  (SErine  Repeat 
Antigens),  is  of  interest  in  that  they  contain  a  cathepsin  protease-like  domain.  A  cluster  of  three 
SERA  genes  each  composed  of  4  exons  and  all  transcribed  in  the  same  direction  (from 
centromere  to  telomere)  was  known  to  be  located  on  chromosome  2  44?  and  at  least  one  has  been 
extensively  evaluated  for  use  in  blood  stage  vaccines.  The  chromosome  2  sequence  demonstrates 
that  the  three  known  SERA  genes  on  chromosome  2  are  part  of  an  8-gene  cluster.  All  but  one  of 
these  8  genes  seem  to  have  a  similar  4  exon  structure,  the  last  gene  at  the  3'  end  of  the  cluster 
contains  only  3  exons  according  to  structure  predictions.  Alignment  of  the  8  SERA  proteins 
revealed  conservation  of  the  central  and  C-terminal  regions,  the  N-termini  being  more  divergent. 
The  first  SERA  gene  described  is  the  only  one  that  contains  a  stretch  of  repeated  serines,  making 
the  generic  name  SERA  quite  inappropriate.  The  protease  domain  in  SERA  is  remarkable  in  that 
5  out  of  the  8  copies  contain  serine  instead  of  cysteine  in  the  active  nucleophile  position, 
suggesting  that  they  are  serine  proteases  with  a  structure  typical  of  cysteine  proteases  45.  The 
expansion  of  this  protease  gene  family  suggests  an  important  function,  possibly  in  merozoite 
release  from  schizonts  or  processing  of  merozoite  surface  proteins. 

Two  copies  of  another  family  of  surface  antigens,  typified  by  MSP-4  46  were  found  on 
the  chromosome.  An  interesting  feature  of  these  proteins  is  the  presence  of  an  epidermal  growth 
factor  (EGF)  module  in  their  extracellular  domains.  Together  with  MSP-1,  a  multi-EGF  domain 
protein  encoded  on  chromosome  3,  and  two  Plasmodium  sexual  stage  antigens  47-49 ^  these  are 
the  only  proteins  outside  the  animal  kingdom  that  contain  EGF  repeats,  and  it  appears  likely  that 
the  sequence  coding  for  this  domain  was  hijacked  by  Plasmodium  from  its  animal  host.  The 
plasmodial  EGF  domain  may  be  involved  in  parasite  adhesion  to  animal  cells  through 
homophilic  interactions. 

In  addition  to  the  families  of  Plasmodium- specific  proteins,  chromosome  2  contains 
genes  for  many  secreted  and  membrane  proteins  that  have  no  identifiable  homologs  in  the 
databases.  Interestingly,  one  of  these  genes  encodes  a  protein  with  a  modified  thrombospondin 
domain,  and  was  found  to  be  transcribed  in  blood  stage  parasites  (data  not  shown).  Other 
Plasmodium  proteins  that  contain  thrombospondin  domains,  such  as  sporozoite  surface  protein 
2/TRAP  and  circumsporozoite  protein,  are  involved  in  parasite  invasion  of  host  cells  50-52  and  it 
is  tempting  to  speculate  that  this  protein  is  involved  in  binding  of  infected  red  cells  to  host  cell 
ligands. 

Optical  mapping  of  P.  falciparum  chromosomes  (added  to  Specific  Aim  1) 

Instability  of  AT-rich  P.  falciparum  DNA  in  E.  coli  was  one  of  several  major  technical 
difficulties  faced  at  the  beginning  of  the  sequencing  project.  Frequent  rearrangements  or 
deletions  of  the  cloned  P.  falciparum  DNA  might  have  prevented  the  sequencing  of  large  parts  of 
the  genome,  or  produce  an  inaccurate  assembly  of  the  sequence.  Even  if  these  problems  had  not 
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been  encountered  during  the  sequencing  and  assembly  process,  validation  of  the  final  sequence 
was  required  to  ensure  that  the  completed  sequence  was  an  accurate  representation  of  the 
genome. 

We  initially  planned  to  use  a  set  of  sequence  markers  from  the  ends  of  chromosome  2- 
specific  Yeast  Artificial  Chromosomes  for  validation  of  the  sequence  (see  below).  It  had  been 
reported  that  YACs  containing  P.  falciparum  DNA  were  quite  stable,  but  workers  in  other  fields 
had  noted  instability  in  some  YACs.  If  a  proportion  of  the  P.  falciparum  YACS  were  unstable, 
they  could  not  be  used  for  validation. 

Dr.  David  Schwartz  at  the  New  York  University  has  developed  a  new  technology,  called 
optical  restriction  mapping,  that  can  generate  restriction  maps  of  large  DNA  molecules  very 
rapidly.  Optical  mapping  does  not  require  cloning  of  the  DNA  to  be  mapped.  So  that 
rearrangements  that  might  occur  during  cloning  are  avoided.  This  technology  provides  a  method 
for  sequence  validation  that  is  completely  independent  of  the  techniques  used  for  generation  of 
the  sequence.  A  collaboration  was  established  with  Dr.  Schwartz’s  laboratory  to  produce  an 
optical  map  of  chromosome  2.  Briefly,  purified  chromosome  2  molecules  prepared  by  Dr.  Dan 
Carucci  at  NMRC  were  fixed  to  glass  substrates,  digested  in  situ  with  Nhe I  or  Bamlil,  stained 
with  YOYO-1,  and  the  order  and  size  of  the  DNA  fragments  was  obtained  by  direct  observation 
of  a  single  DNA  molecules  by  fluorescence  microscopy.  Fragment  size  was  determined  by 
automated  fluorescence  intensity  measurements  53.  The  optical  restriction  maps  were  compared 
with  restriction  maps  predicted  from  the  completed  chromosome  2  sequence.  The  relative  error 
of  predicted  and  observed  fragment  sizes  was  4.3%  and  5.8%  for  the  Nhel  and  BamUi  maps, 
respectively.  The  correspondence  between  the  two  data  sets  showed  that  there  were  no  major 
rearrangements  in  the  assembled  sequence,  and  confirmed  that  the  chromosome  2  sequence  had 
been  properly  assembled.  A  manuscript  describing  these  findings  has  been  submitted  for 

publication  (see  Appendix). 

The  successful  application  of  the  optical  mapping  approach  to  sequence  validation  of 
chromosome  2  led  the  Malaria  Genome  Sequencing  Consortium  to  recommend  that  Dr. 
Schwartz’s  laboratory  be  funded  to  generate  restriction  maps  of  the  entire  P.  falciparum  genome. 
A  map  of  the  complete  genome  was  recently  determined,  and  all  of  the  sequencing  centers  are 
using  the  maps  for  sequence  validation.  In  addition,  TIGR  and  NMRC  are  collaborating  with  Dr. 
Schwartz  in  developing  methods  to  use  the  optical  restriction  maps  for  gap  closure  as  well  as  for 
validation. 

Construction  of  a  chromosome  2  YAC  map  (Specific  Aim  Id) 

As  mentioned  previously,  our  original  plan  for  verification  of  the  sequence  involved 
comparison  of  the  chromosome  2  sequence  to  a  scaffold  of  YAC  end  sequences  from  a  3D7 
YAC  library.  YAC  clones  were  isolated  by  PCR  screening  of  the  YAC  library  7  using  primers 
derived  from  known  chromosome  2  STSs  12  YACs  were  sized  by  pulse-field  gel 
electrophoresis.  YAC  terminal  sequences  were  obtained  by  digestion  of  YAC  DNA  with  Alul, 
Dral,  Msel,  Rsal,  SspI  ,or  Swal.  DNA  fragments  were  ligated  to  adaptors  and  submitted  to  PCR 
amplification  with  a  combination  of  vector-  and  adaptor-specific  primers.  DNA  sequencing  was 
performed  for  both  strands  using  primers  for  the  adaptor  and  for  the  yeast  vector.  A  YAC  map 
was  constructed  by  aligning  the  terminal  sequences  to  the  chromosome  2  assembly.  A 
manuscript  describing  these  results  is  in  preparation,  and  further  details  will  be  provided  in  a 
subsequent  report. 
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Development  of  a  P.  falciparum  gene  finding  program  (added  to  Specific  Aim  2) 

As  the  chromosome  2  sequence  was  nearing  completion,  it  became  apparent  that  a  new 
gene  finding  program  would  be  needed  to  aid  in  annotation  of  the  sequence.  Several  gene  finding 
programs  designed  to  predict  genes  in  organisms  such  as  human  and  the  plant  Arabidopsis  had 
been  tested  on  the  P.  falciparum  chromosome  2  sequence,  and  had  not  been  able  to  accurately 
detect  known  chromosome  2  genes.  Dr.  Stephen  Salzberg  at  TIGR  had  been  involved  in  the 
development  of  the  prokaryotic  gene  finder  Glimmer,  which  is  used  at  TIGR  for  gene  prediction 
for  all  of  TIGR’s  bacterial  genome  projects.  Dr.  Salzberg  offered  to  modify  Glimmer  for 
prediction  of  genes  in  P.  falciparum ,  and  with  other  members  of  his  laboratory  produced  a 
program  called  GlimmerM  that  was  used  for  annotation  of  chromosome  2. 

Details  of  the  GlimmerM  software  are  beyond  the  scope  of  this  report,  but  a  detailed 
description  of  the  programs  is  provided  in  a  paper  submitted  for  publication  23  (Appendix). 
Briefly,  the  GlimmerM  program,  when  provided  with  a  training  set  of  well-characterized  P. 
falciparum  genes,  constructs  a  statistical  model  of  coding  sequences  and  donor  and  acceptor 
splice  sites.  New,  uncharacterized  P.  falciparum  sequence  is  then  analyzed  by  GlimmerM,  and  a 
set  of  putative  gene  models  is  produced.  These  models  are  then  evaluated  by  expert  annotators  in 
conjunction  with  other  evidence  such  as  database  matches,  the  presence  of  signal  peptides  and 
transmembrane  domains,  etc.,  to  produce  the  final  gene  models  reported  in  the  annotation.  These 
models  are  based  on  the  best  available  evidence  but  shuld  be  confirmed  as  preliminary  until 
confirmed  by  other  methods.  The  accuracy  of  GlimmerM  will  improve  as  the  training  set  of 
well-characterized  P.  falciparum  sequences  is  enlarged,  and  further  refinements  to  the  algorithm 
are  made.  GlimmerM  will  be  extremely  useful  for  annotation  of  the  P.  falciparum  genome,  and 
has  already  been  distributed  to  other  members  of  the  Malaria  Genome  Sequencing  Consortium. 


Construction  of  P AC  libraries  (Specific  Aim  lc) 

One  of  the  major  impediments  to  research  into  the  molecular  biology  of  malaria  parasites 
is  the  lack  of  a  E.  coli  -  based  cloning  vector  that  can  accommodate  large  (>  25  kb)  inserts  of  P. 
falciparum  DNA  without  deletion  or  rearrangement.  In  principle,  a  large  insert  library  would  be 
usefiil  for  sequencing  of  the  P.  falciparum  genome,  by  providing  either  a  source  of  clones  for 
sequencing,  or  as  sequence  markers  to  simplify  linking  of  contigs  into  groups. 

Recently,  a  new  E.  coli  cloning  system,  called  Bacterial  Artificial  Chromosomes  (B ACs), 
that  can  accommodate  DNA  inserts  greater  than  100  kb,  has  been  developed.  A  derivative  of 
BAC  vector,  called  a  PAC  (PI  artificial  chromosome)  vector,  has  also  been  developed.  The  BAC 
and  PAC  vectors  have  been  shown  to  stably  maintain  “difficult”  inserts  from  a  variety  of  species. 
We  collaborated  with  Dr.  Pieter  de  Jong  at  the  Roswell  Park  Cancer  Institute,  to  determine 
whether  P.  falciparum  DNA  would  also  be  stable  in  PAC  vectors. 

P.  falciparum  DNA  was  provided  to  Dr.  de  Jong’s  laboratory,  and  a  PAC  library  with  an 
average  insert  size  of  8  kb  was  prepared  (work  in  Dr.  de  Jong’s  lab  was  funded  by  a  grant  from 
the  Burroughs  Wellcome  Fund).  The  library  was  screened  with  chromosome  2  STS  markers  to 
identify  chromosome  2  clones,  and  the  ends  of  these  clones  were  sequenced.  These  sequences 
were  compared  with  the  known  chromosome  2  sequence,  and  62  clones  were  identified  that  had 
90%  sequence  identity  to  chromosome  2  over  100  bp.  Inspection  of  the  data  suggested  that  no 
rearrangements  had  occurred  in  these  clones,  but  this  conclusion  must  be  confirmed  by  further 
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laboratory  analyses.  Other  studies  are  underway  in  Dr.  de  Jong’s  lab  to  determine  whether  the 
insert  sizes  predicted  from  the  sequence  agree  with  those  measured  experimentally,  and  whether 
the  clones  are  stably  maintained  through  extended  culture  periods.  If  these  clones  are  stable,  the 
PAC  library  will  be  useful  for  generation  of  end  sequences  for  linking  of  contig  groups. 

Sequencing  of  P.  falciparum  chromosome  14  (Specific  Aim  1) 


Sequencing  of  chromosome  14  is  being  funded  primarily  by  a  grant  from  the  Burroughs 
Wellcome  Fund;  funds  from  this  collaborative  agreement  are  being  used  to  accelerate  the 
sequencing,  assist  in  closure  and  annotation,  develop  microarrays  for  chromosome  14,  and 
facilitate  rapid  utilization  of  the  sequence  data  by  the  DoD  vaccine  and  drug  development 
groups. 

Library  preparation. 

Preparation  of  a  high-quality  library  of  sheared  DNA  fragments  is  essential  for  successful 
shotgun  sequencing  of  microbial  genomes.  The  strategy  adopted  for  sequencing  of  the 
Plasmodium  genome  involves  purification  of  chromosomal  DNA  on  pulsed  field  gels,  followed 
by  construction  of  shotgun  libraries  and  random  sequencing.  After  determination  of  the 
appropriate  electrophoresis  conditions,  chromosome  14  DNA  was  purified  by  two  rounds  of 
preparative  pulsed  field  gel  electrophoresis.  The  first  round  was  performed  in  0.8%  chromsomal 
grade  agarose  (Figure  4),  and  the  second  round  was  in  0.8%  LMP  agarose.  Conditions  were:  lx 
TAE,  500  sec  switch  time  at  a  field  angle  of  106  degrees,  3  V/cm  for  48  hours  at  14  C.  The 
agarose  slices  from  10  pulsed  field  gels  (20  48-hour  runs!)  were  equilibrated  in  0.3  M  sodium 
acetate,  melted  at  70C,  and  digested  with  agarase.  The  digested  agarose  was  extracted  with 
phenol  and  the  DNA  was  precipitated  with  ethanol.  After  shearing  in  a  nebulizer,  a  shotgun 
library  was  prepared  in  pUC18  using  the  v+i  method  described  previously  55.  Two  libraries  were 
prepared  with  insert  sizes  of  1.2  - 1.6  kb  and  1.6  -  2.0  kb;  each  library  contained  1.6  x  107 
recombinants. 

High-throughput-sequencing. 

During  the  chromosome  2  project  we  discovered  that  dye-primer  chemistry  produced 
frequent  artifacts  with  very  AT-rich  templates,  but  that  dichlororhodamine  FS+  dye  terminator 
chemistry  gave  fewer  artifacts  and  longer  read  lengths.  Consequently,  all  reactions  for 
chromosome  14  were  performed  with  the  dichlororhodamine  FS+  dye  terminator  chemistry. 
Currently,  the  success  rate  for  chromosome  14  sequencing  reactions  is  72%  and  the  average  read 
length  is  532  nt;  the  figures  for  chromosome  2  were  66%  and  507  nt,  respectively.  Furthermore, 
inspection  of  the  electropherograms  indicates  that  high  quality  sequences  are  being  obtained,  and 
there  are  many  fewer  indications  of  the  artifacts  in  AT-rich  areas  that  were  problematic  with 
chromosome  2.  The  artifacts  in  dye  primer  reads  were  a  major  impediment  to  assembly  and  gap 
closure  of  chromosome  2;  the  lack  of  such  artifacts  in  the  chromosome  14  data  implies  that 
closure  of  chromosome  14  should  be  a  simpler  process. 

Early  in  the  random  sequencing  phase,  test  assemblies  of  the  chromosome  14  sequences 
were  performed  to  assess  how  closely  they  compared  to  theoretical  predictions  for  a  completely 
random  library.  The  data  indicate  that  there  are  no  major  problems  (i.e.  excessive  vector 
contamination,  pileups  due  to  gross  non-randomness,  etc.)  with  the  two  chromosome  14  libraries 
being  sequenced.  However,  contrary  to  expectations,  there  seems  to  be  about  15%  cross- 
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contamination  of  the  chromosome  14  DNA  with  DNA  from  other  chromosomes.  This  is  similar 
to  what  was  found  with  chromosome  2,  and  indicates  that  the  second  round  of  gel-purification 
did  not  reduce  cross-contamination  significantly. 

Closure  of  sequence  and  physical  gaps,  and  validation  of  the  assembly. 

The  random  sequencing  phase  was  completed  in  December  1998.  74,292  sequence  reads, 
equivalent  to  10X  coverage,  were  obtained.  An  assembly  was  performed  with  the 
TIGR  Assembler  software  that  had  been  modified  to  assemble  AT-  and  repeat-rich  P. 
falciparum  sequences.  This  produced  1750  contigs  and  3397  singletons.  The  contigs  were 
released  to  the  public  on  the  TIGR  web  site  (http://www.tigr.org/tdb/mdb/pfdb/pfdb.html)  in 
December  1998.The  program  grouper  was  then  used  to  group  neighboring  contigs  and  identify 
physical  and  sequence  gaps  in  the  sequence.  Grouper  produced  458  groups  of  contigs  totaling  5.5 
Mb.  As  with  the  chromosome  2  project,  we  know  that  the  chromosome  14  library  contained 
sequences  from  other  chromosomes.  Since  chromosome  14  is  approx.  3.4  Mb  in  length,  and  the 
groups  total  5.5  Mb,  approx.  2  Mb  of  the  sequence  may  not  be  from  chromosome  14. 

Closure  efforts  will  focus  on  those  groups  of  contigs  proven  to  be  on  chromosome  14  by 
comparison  to  a  set  of  chromosome  14  sequence  markers,  and  to  groups  larger  than  5  kb.  The 
contigs  from  the  assembly  were  compared  to  more  than  70  chromosome  14  markers,  which 
identified  135  contigs  and  63  contig  groups  totaling  2.99  Mb,  or  88%  of  the  chromosome.  With 
chromosome  2,  groups  of  contigs  were  ordered  on  the  chromosome  by  alignment  of  the  contigs 
with  published  STS  markers.  A  similar  strategy  will  be  used  with  chromosome  14.  In  addition, 
Bam  HI  and  Nhe  I  optical  restriction  maps  of  chromosome  14  have  been  prepared  in  David 
Schwartz’s  laboratory  at  NYU.  Optical  restriction  maps  of  chromosome  2  were  used  as  an 
independent  verification  of  the  final  assembly.  For  chromosome  14,  optical  maps  will  be  used 
during  the  closure  process  to  assist  in  ordering  of  contigs  on  the  chromosome  and  sizing  of  the 
physical  gaps.  Thus  the  information  provided  by  grouper,  the  optical  maps,  and  comparison  of 
contigs  to  the  chromosome  14  STS  markers  will  be  used  to  map  the  contigs  on  the  chromosome, 
and  the  map  will  serve  as  the  starting  point  for  the  gap  closure  process. 

The  techniques  to  be  used  for  gap  closure  were  developed  during  the  sequencing  of 
several  microbial  genomes  at  TIGR,  and  were  proven  to  work,  with  some  modifications,  for 
closure  of  chromosome  2.  Many  of  these  procedures  rely  on  retrieval  of  data  from  a  relational 
database  that  contains  all  of  the  shotgun  sequence  and  assembly  data.  For  sequence  gaps  these 
techniques  include  a)  performing  long  gel  runs  on  templates  that  are  near  the  ends  of  contigs  and 
point  into  the  gap,  b)  primer  walking  on  plasmid  templates  that  span  the  gaps,  c)  editing  of 
contig  ends  to  remove  vector  or  low  quality  sequences  that  prevented  contig  merging,  and  d)  for 
very  AT-rich  gaps  that  cannot  be  closed  with  the  other  techniques,  the  artificial  transposon  AT-2 
(3)  will  be  used  to  insert  primer-binding  sites  into  templates  spanning  the  sequence  gaps.  The 
AT-2  transposon  was  used  to  close  six  very  AT-rich  sequence  gaps  in  chromosome  2  that  could 
not  be  closed  with  the  other  techniques,  and  to  solve  a  long  repeat  structure  within  the  PfEMP3 
gene.  Briefly,  an  in  vitro  transposition  reaction  was  performed  to  randomly  insert  the  AT-2 
transposon  into  plasmid  templates  that  spanned  the  sequence  gap.  Transposon-containing 
subclones  were  identified  by  selection  on  ampicillin  and  trimethoprim,  and  multiple  subclones 
from  each  sequence  gap  were  sequenced  using  transposon-specific  primers.  The  sequences  were 
then  assembled  to  close  the  gaps. 

Physical  gaps  will  be  closed  using  several  techniques.  First,  clones  at  the  ends  of  contigs 
which  have  only  one  good  sequence  in  the  database  (i.e.  forward  or  reverse),  and  for  which  the 
missing  sequence  should  fall  in  the  gap  (the  “missing  mate”)  will  be  identified,  and  the  missing 
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sequence  reaction  for  these  clones  will  be  repeated.  The  missing  mate  sequence  will  often  span 
the  physical  gap  by  itself,  or  identify  an  unmapped  contig  which  can  be  used  to  close  the  gap. 
Second,  PCR  using  genomic  DNA  and  primers  complementary  to  the  ends  of  contigs  will  be 
used  to  amplify  fragments  spanning  physical  gaps;  the  fragments  will  be  sequenced  and  the 
sequences  assembled  to  close  the  gaps.  As  with  chromosome  2,  this  process  will  proceed  in  two 
phases.  In  the  first  phase,  primers  from  contigs  predicted  to  be  adjacent  on  the  chromosome  will 
be  used  in  PCR  reactions  with  genomic  DNA.  This  will  enable  rapid  closure  of  the  small 
physical  gaps  with  a  minimum  number  of  PCR  reactions.  In  the  second  phase,  primers  from  the 
mapped  contigs  bordering  the  remaining  physical  gaps  will  be  used  in  PCR  reactions  with 
primers  from  the  ends  of  unmapped  contigs  larger  than  2.5  kb.  This  will  identify  unmapped 
contigs  that  fall  within  the  larger  physical  gaps  between  the  mapped  contigs.  PCR  conditions  for 
these  reactions  were  optimized  during  the  chromosome  2  project  (long-range  PCR  with  60C 
annealing  temperature). 

The  sequence  will  be  evaluated  with  the  program  check_coverage  to  ensure  that  a)  all 
regions  of  the  assembly  are  covered  by  at  least  two  shotgun  clones,  and  b)  that  every  base  pair  in 
the  sequence  has  been  sequenced  in  both  directions  with  one  chemistry,  or  in  one  direction  with 
two  chemistries.  These  criteria  ensure  that  the  sequence  has  been  assembled  correctly  and 
validate  individual  base  calls.  The  latter  criterion  is  often  satisfied  by  performing  10%  of  the 
sequence  reactions  with  dye-primer  chemistry.  However,  given  the  frequency  of  sequence 
artifacts  in  AT-rich  regions  observed  with  the  dye-primer  chemistry,  this  may  not  be  appropriate 
for  P.  falciparum.  As  we  discovered  with  chromosome  2,  inclusion  of  sequences  containing 
artifacts  in  an  assembly  inhibits  contig  formation  and  increases  the  number  of  sequence  gaps  in 
the  assembly  and  the  effort  required  to  close  them.  Consequently,  all  chromosome  14  sequencing 
is  being  done  with  dye-terminator  chemistry,  and  late  in  the  random  phase  the  coverage  status  of 
the  assembly  will  be  assessed.  Regions  with  one-direction  coverage  will  be  identified,  and 
additional  dye-terminator  reactions  selected  from  the  database  will  be  performed  to  convert  as 
many  as  possible  to  two-direction  coverage.  Regions  with  one-direction  coverage  that  remain 
will  then  be  re-sequenced  with  dye-primer  chemistry.  This  process  will  ensure  that  the  coverage 
criteria  are  satisfied  and  minimize  potential  assembly  problems  arising  from  use  of  dye-primer 
chemistry.  Finally,  the  sequence  will  be  edited  using  the  program  TIGR_Editor,  which  displays 
all  gel  reads  and  electropherograms  for  each  base  in  the  sequence.  Discrepancies  will  be  noted 
and  additional  sequencing  reactions  will  be  performed  to  resolve  ambiguities.  As  a  last  step  to 
confirm  colinearity  of  the  assembled  sequence  and  genomic  DNA,  restriction  maps  predicted 
from  the  sequence  will  be  compared  with  the  chromosome  14  optical  restriction  maps  described 
above. 

(Note:  for  the  sake  of  clarity  the  steps  in  closure  process  were  presented  in  chronological 
order.  However,  to  speed  closure  several  steps  can  proceed  simultaneously;  for  example, 
coverage  can  be  checked  and  improved  before  gap  closure  is  finished.  By  working  in  a  parallel 
fashion  chromosome  14  can  be  completed  efficiently). 

Annotation. 

Elucidation  of  gene  structure  will  be  performed  with  the  program  GlimmerM,  a 
eukaryotic  gene-finding  developed  at  TIGR  specifically  for  the  malaria  genome  project  (see 
section  above).  Before  the  annotation  of  chromosome  14  begins,  GlimmerM  will  be  refined  to 
improve  accuracy  and  the  training  set  will  be  updated  with  newly-published  sequences,  so  that  a 
more  robust  gene-finding  tool  will  be  available  once  the  sequence  is  completed.  Predicted  coding 
regions  will  be  searched  against  the  sequence  and  protein  databases  using  our  standard  methods. 
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Repetitive  elements  and  other  features  will  also  be  identified  and  annotated.  Since  many  genes 
will  have  no  database  matches,  defining  the  boundaries  of  genes  will  be  challenging.  Most  of  the 
software  necessary  for  annotation  was  tested  during  the  chromosome  2  project,  and  will  require 
only  a  few  minor  modifications  for  use  on  chromosome  14  The  annotation  performed  under  this 
grant  will  by  necessity  be  preliminary.  Our  goal  is  to  provide  a  starting  point  for  further 
biological  characterization.  We  will  facilitate  public  access  to  the  sequence  by  release  of 
preliminary  and  finished  sequence  on  the  TIGR  web  site 

(http://www.tigr.org/tdb/mdb/pfdb/pfdb.html).  This  will  include  full  text-  and  sequence-based 
searching  of  chromosomes  2  and  14,  as  well  as  links  to  other  sources  of  P.  falciparum  sequence 
data  such  as  the  Sanger  Center  and  Stanford  University.  Preliminary  contigs  were  released  on  the 
TIGR  web  site  in  December  1998. 

Sequencing  of  chromosomes  10  and  11  (Specific  Aim  1) 

Chromosomes  10  and  1 1  are  being  sequenced  with  funding  provided  by  the  National 
Institute  for  Allergy  and  Infectious  Diseases.  Sequencing  of  chromosome  1 1  began  in  December 
1998  and  should  be  completed  in  1999,  so  that  closure  can  begin  late  in  1999.  Raw  sequence 
reads  of  chromosome  1 1  were  released  on  the  TIGR  web  site  in  January  1999  and  will  be 
updated  periodically  (http://www.tigr.org/tdb/mdb/pfdb/pfdb.html).  Construction  of  chromosome 
10  libraries  is  underway,  and  sequencing  should  begin  by  mid- 1999.  Progress  on  these  projects 
will  be  updated  in  next  year’s  annual  report. 

Preparation  of  chromosome  12  and  13  DNA  for  sequencing  (Specific  Aim  1) 

Dr.  Daniel  Carucci  of  NMRC  provided  gel-purified  chromosomal  DNA  to  Richard 
Hyman  at  Stanford  University  (chromosome  12)  and  Daniel  Lawson  at  the  Sanger  Center 
(chromosome  13).  These  investigators  have  used  this  material  for  sequencing  of  chromosomes 
funded  by  the  BWF  and  Wellcome  Trust. 

We  have  consulted  with  Richard  Hyman  and  Eula  Fung  at  Stanford  University,  who  are 
sequencing  chromosome  12,  regarding  the  assembly  and  gap  closure  techniques  we  developed 
during  the  chromosome  2  work.  In  addition,  during  the  chromosome  2  project  we  obtained  many 
sequences  from  other  parts  of  the  genome  due  to  the  co-migration  of  sheared  DNA  from  other 
chromosomes  with  chromosome  2.  Sequences  (4,563)  that  were  not  part  of  the  final  chromosome 
2  assembly  were  provided  to  the  Stanford  group;  these  will  be  examined  to  determine  whether 
they  span  any  gaps  in  the  chromosome  12  sequence. 


Microarray  studies  (added  to  Specific  Aim  1) 


Pilot  studies  conducted  at  NMRC  in  conjunction  with  TIGR  have  been  initiated  to 
develop  DNA  microarray  technology  for  the  functional  analysis  of  the  P.  falciparum  genome. 
DNA  microarrays  can  be  used  to  examine  the  expression  patterns  of  thousands  of  genes 
simultaneously  from  two  or  more  RNA  samples.  These  RNA  samples  may  be  derived  from 
altered  growth  conditions,  different  life  stages,  and  others  in  order  to  determine  the  complement 
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of  genes  that  may  be  involved  in  certain  regulatory  processes.  These  pilot  studies  were  carried 
out  using  the  data  generated  from  the  chromosome  2  project. 

Chromosome  2-specific  DNA  microarrays  were  constructed  by  identifying  individual 
open  reading  frames  (ORFs)  from  each  of  the  predicted  genes.  Forward  and  reverse 
oligonucleotide  primer  sequences  were  designed  using  the  computer  program  “Primer  3”  from  at 
least  one  of  each  of  the  209  predicted  chromosome  2  genes.  In  total  245  primer  pairs  were 
synthesized. 

Each  ORF  was  amplified  from  P.  falciparum  (clone  3D7)  genomic  DNA  by  the 
polymerase  chain  reaction  (PCR)  and  verified  by  agarose  gel  electrophoresis.  The  PCR  products 
were  purified  and  resuspended  in  DMSO  prior  to  arraying.  The  DNA  microarrays  were  prepared 
using  the  Molecular  Dynamics  Arrayer  robot  to  spot  less  that  1  nanoliter  of  each  amplified 
product  onto  the  surface  of  prepared  glass  slides.  Several  test  arrays  were  made  using 
commercially  available  silanated  glass  slides  or  internally  prepared  poly-L-lysine  slides.  In 
general,  the  poly-L-lysine  slides  gave  the  best  signal  to  noise  ratio. 

Total  RNA  was  prepared  from  cultured  P.  falciparum  (clone  3D7)  taken  at  several  time 
points.  The  cDNA  from  each  RNA  species  was  differentially  labelled  with  either  dUTP-Cy3  or 
dUTP-Cy5  and  hybridized  to  a  DNA  microarray  at  65°  C  overnight.  After  several  washes  the 
DNA  microarrays  were  scanned  using  a  ScanArray®  3000  dual  color  confocal  laser  system. 
Fluorescence  intensity  measurements  of  each  spot  were  made  using  the  computer  program 
ImaGene™.  An  example  of  one  experiment  comparing  the  gene  expression  of  two  stages  of  P. 
falciparum  blood  stage  development  is  shown  in  Figure  5.  Studies  are  now  underway  to  expand 
the  chromosome-specific  DNA  microarrays  to  the  nearly  completed  chromosome  3  (Sanger)  and 
additional  chromosome  projects  as  the  data  become  available.  In  addition,  a  SyBase  relational 
database  is  being  developed  to  accommodate  the  vast  quantities  of  DNA  microarray  data  that 
will  be  generated  from  this  project.  SyBase  was  chosen  as  the  DNA  microarray  relational 
database  so  as  to  provide  a  seamless  integration  of  data  generated  from  the  chromosome  2, 10, 

1 1  and  14  projects  at  TIGR.  A  Web  interface  has  also  been  developed  to  facilitate  data  entry  and 
data  tracking. 


Conclusions 

The  objectives  of  this  5-year  Cooperative  Agreement  between  TIGR  and  the  USAMRMC 
were  to:  Specific  Aim  1,  sequence  3.5  Mb  of  P.  falciparum  genomic  DNA; 
Specific  Aim  2,  annotate  the  sequence;  Specific  Aim  3,  release  the  information  to  the  scientific 
community.  Excellent  progress  was  made  towards  achievement  of  these  goals.  The  complete 
sequence  of  P.  falciparum  chromosome  2  (1  Mb)  was  determined,  published  in  Science,  and 
released  on  the  TIGR  web  site  (http://www.tigr.org/tdb/mdb/pfdb/pfdb.html).  This  is  the  first 
malaria  chromosome  to  be  sequenced  by  the  Malaria  Genome  Sequencing  Consortium.  Many 
techniques  were  developed  that  will  facilitate  sequencing  of  the  AT-rich  P.  falciparum  genome, 
including:  modification  of  the  sequencing  chemistry;  development  of  assembly  software  and  gap 
closure  methods  for  AT-rich  DNA;  development  of  new  gene  finding  software,  GlimmerM; 
construction  of  a  chromosome  2  YAC  map  and  P.  falciparum  PAC  libraries;  and,  initiation  of 
microarray  studies  to  examine  expression  of  hundreds  of  genes.  The  success  of  this  project 
demonstrates  that  the  extreme  AT-richness  of  the  DNA  will  not  prevent  sequencing  of  the  entire 
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genome.  Malaria  researchers  will  be  able  to  apply  this  information  to  the  study  of  Plasmodium 
biology  and  to  development  of  new  drugs  and  vaccines  for  against  malaria. 
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Table  1.  Summary  of  features  of  P.  falciparum  chromosome  2  and  comparison  to  S. 
cerevisiae  chromosome  3. 

*ND,  not  determined. 

^Protein  structural  features  were  predicted  as  described  (1 1). 


Description 

Number 

P.  f.  chr  2 

S.  c.  chr  3 

Chromosome  length  (kb) 

945 

315 

G  +  C  content  (%) 

19.7 

38.6 

Exons 

24.3 

40.0 

Introns 

13.3 

ND* 

Kilobases  per  gene 

4.50 

1.73 

Number  of  predicted  protein  coding  regions 

209 

171 

Number  of  genes  with  introns  (%) 

90  (43) 

4  (2.2) 

tRNA  genes 

1 

10 

Class  of  proteins1 

f 

Total 

209 

171 

Secreted  (%) 

22  (11) 

11(6) 

Integral  membrane  (%) 

90  (43) 

42  (24) 

Integral  membrane  with  multiple 

predicted  transmembrane  domains  (%) 

27  (13) 

21  (12) 

Containing  coiled-coil  domains  (%) 

111 (53) 

32  (19) 

Containing  other  large  compositionally  biased 

regions  with  predicted  non-globular  structure  (%) 

155  (75) 

71  (41) 

Completely  non-globular  (%) 

17(8) 

6  (3.5) 

With  detectable  homologs  in  other  species 

87  (42) 

145  (85) 
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Table  2.  Identification  of  genes  on  P.  falciparum  chromosome  2. 

PF#,  systematic  name  assigned  according  to  a  method  adapted  from  S.  cerevisiae  (11). 

Description:  Name,  if  known,  and  prominent  features  of  the  gene. 

Abbreviations  are  as  follows:  euk,  eukaryotic;  nt,  nucleotide;  00,  organellar  origin;  prt,  protein;  TP,  transit  peptide. 

PF#  Description  PF#  Description 


Amino  acid  biosynthesis 

PFB0200c  aspartate  aminotransferase 
Biosynthesis  of  cofactors,  prosthetic  groups,  and  carriers 
PFB0130w  prenyl  transferase 
PFB0220w  ubiquinone  biosynthesis  methyltrans. 

Fatty  acid  and  phospholipid  metabolism 
PFB0385w  acyl-carrier  protein 
PFB041  Oc  phospholipase  A2-like  a/b  fold  hydrolase 
PFB0505c  3-ketoacyl  carr.  prt  synthase  III,  FabH  (00,  TP) 

PFB0685C  ATP-dept.  acy!-CoA  synthetase  (TP) 

PFB0695C  ATP-dept.  acyl-CoA  synthetase  (TP) 

Purines,  pyrimidines,  nucleosides,  and  nucleotides 
PFB0295w  adenylosuccinate  lyase  (00) 

DNA  metabolism 

PFB0160w  ERCCI-like  excision  repair  prt 
PFB0180w  prt  with  5-3'  exonucl.  domain  (00,  TP) 

PFB0205C  prt  with  5-3'  exonucl.  domain  (Kem-1  family) 

PFB0265c  RAD2  endonucl. 

PFB0440C  chromatinic  RING  finger  prt,  DRING  ortholog 
PFB0720c  on.  recognition  cmplx  subunit  5  (ATPase) 

PFB0730W  BRAHMA  ortholog  (DNA  helicase  superfamily  I! ) 
PFB0840W  replication  factor  C,  40  kDa  subunit  (replication  activat.) 
PFB0875C  chromatin-binding  prt  (SKI/SNW  family) 

PFB0895c  replication  factor  C,  140  kDa  subunit  (ATPase) 

Energy  metabolism 

PFB0795W  ATP  synthase  alpha  chain 
PFB0880w  FAD-dependent  oxidoreductase  (00) 

Transcription 

PFB0140w  metal  binding  prt  (DHHC  domain) 

PFB01 75c  prt  of  the  MAK1 6  family 

PFB0215C  prt  with  Egl-like  3‘-5‘  exonucl.  domain 

PFB0245C  RNA  polymerase  16kD  subunit,  RPB4-like 

PFB0255w  RRM  type  RNA  binding  prt 

PFB0290c  Zn-ribbon  transcription  factor(TFHS  family) 

PFB0370C  RNA-binding  prt  (KH  domain) 

PFB0445C  elF-4A-like  DEAD  family  RNA  helicase 
PFB0620w  Y0U2-like  small  euk.  C2C2  zinc  finger  prt 
PFB071 5w  DNA-directed  RNA  polymerase  subunit  2 
PFB0725c  metal  binding  prt  (DHHC  domain) 

PFB0855C  rRNA  methylase  (Spoil  family)  (00,  TP) 

PFB0860C  RNA  helicase 

PFB0865W  small  nuclear  ribonucleoprt.  (SNRNP  family) 

PFB0890c  pseudouridine  synthet.  (RsuA  fam.);  1  st  euk.  member  (00) 
Translation  and  post-translational  modification 
PFB0165W  tRNA-Glu 

PFB0240w  PINT  domain  prt  (proteasomal  subunit) 

PFB0260w  PSD2-like  26S  proteasomal  subunit 
PFB0325C  SERA  antigen/  protease  with  active  Cys 
PFB0330C  SERA  antigen/  protease  with  active  Cys 
PFB0335C  SERA  antigen/  protease  with  active  Cys 
PFB0340c  SERA  antigen/  protease  with  active  Ser 
PFB0345c  SERA  antigen/  protease  with  active  Ser 
PFB0350c  SERA  antigen/  protease  with  active  Ser 
PFB0355c  SERA  antigen/  protease  with  active  Ser 
PFB0360c  SERA  antigen/  protease  with  active  Ser 
PFB0380C  phosphatase  (acid  phosphatase  family) 

PFB0390w  ribosome  releasing  factor  (00,  TP) 

PFB0455w  ribosomal  prt  L37A 

PFB0515w  glycosy!  transferase  (novel  euk.  family) 

PFB0525w  asparaginyt-tRNA  synthetase  (OO,  TP) 

PFB0545C  ribosomal  prt  L7/L12  (00) 

PFB0550w  euk.  peptide  chain  release  factor 

PFB0585w  Leu/Phe-tRN A  prt  transferase,  1  st  euk.  member  (OO) 

PFB0645c  ribosomal  prt  LI  3  (00) 

PFB0830w  ribosomal  prt  S26 
PFB0885w  ribosomal  prt  S30 


Regulatory  functions 

PFB01 50c  Ser/Thr  prt  kinase 
PFB0510w  GAF  domain  prt  (cyclic  nt  signal  transduct.) 
PFB0520w  novel  prt  kinase 
PFB0605w  Ser/Thr  prt  kinase 
PFB0665w  Ser/Thr  prt  kinase 
PFB0815w  calcium-dept.  prt  kinase  (C-term.  EF  hand) 
Transport 

PFB0210C  monosaccharide  transporter 
PFB0275w  membrane  transporter 
PFB0435C  predicted  amine  transporter 
PFB0465c  membrane  transporter 
Cell  surface 

PFBOOlOw  vargene 
PFB0015c  rHin 
PFB0020c  var  gene  fragment 
PFB0025c  rifin 
PFB0030c  rifin 
PFB0035c  rifin 
PFB0040C  rifin 
PFB0045C  var  gene  fragment 
PFBOOSOc  rifin  pseudogene 
PFB0055c  rifin 
PFB0060w  rifin 
PFB0065W  rifin 

PFBOIOOc  knob-associated  His-rich  prt 
PFB0300c  merozoite  surface  antigen  MSP-2 
PFB0305C  merozoite  surface  antigen  MSP-5  (EGF  domain) 
PFB0310c  merozoite  surface  antigen  MSP-4  (EGF  domain) 
PFB0400w  PfS230  paralog  (predicted  secreted  prt) 
PFB0405w  transmission  blocking  target  antigen  PfS230 
PFB0570w  predicted  secreted  prt  (thrombospondin  domain ) 
PFB0760w  Mtn3/RAG  1 1P-like  prt 
PFB0915w  RESA-H3  antigen 
PFB0955W  rifin 
PFB0975c  var  gene  fragment 
PFBIOOOw  rifin  pseudogene 
PFB1005W  rifin 
PFBIOlOw  rifin 
PFB1015W  rifin 
PFB1020W  rifin 
PFB1025w  var  gene  fragment 
PFB1030w  var  gene  fragment 
PFB1035w  rifin 
PFB1040W  rifin 
PFB1045w  var  gene  fragment 
PFB1050w  rifin 
PFB  1055c  vargene 
Other  cellular  processes 

PFB0085c  prt  with  DnaJ  domain  (RESA-like) 

PFB0090c  prt  with  DnaJ  domain 

PFB0450w  prt  translocation  complex,  sec61  gamma  chain 

PFB0480w  syntaxin 

PFBOSOOc  RAB  GTPase 

PFB0595W  prt  with  DnaJ  domain,  DNJ1/SIS1  family 
PFB0635w  T-complex  prt  1  (HSP60  fold  superfamily) 
PFB0640C  WEB-1  ortholog,  WD40 
PFB0750w  VPS45-like  prt  (STXBP/UNC-1 8/SEC  1  family) 
PFB0805c  clathrin  coat  assembly  prt 
PFB0920w  prt  with  DnaJ  domain  (RESA-like) 

PFB0925W  prt  with  DnaJ  domain  (RESA-like) 

Unknown  function 

PFB0270w  member  family  of  bacterial  prts  (OO) 

PFB0320c  member  hesB  fam.  (poss.  redox  activity,  00, TP) 
PFB0420w  YgbB  prt,  1  st  euk.  member  (00,  TP) 

PFB0425C  prt  of  the  YMR7  family 
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Figure  1  Legend.  Gene  map  of  P.  falciparum  chromosome  2. 

Predicted  coding  regions  are  shown  on  each  strand.  Exons  of  protein  encoding  genes  are 
indicated  by  rectangles  with  lines  linking  rectangles  representing  introns.  The  single  tRNAGlu 
gene  is  indicated  by  a  cloverleaf  structure.  Genes  are  color-coded  according  to  broad  role 
categories  as  shown  in  the  key.  Gene  identification  numbers  correspond  to  those  in  Table  2.  The 
letters  CC  and  NG  followed  by  numerals  indicate  the  number  of  predicted  coiled-coil  and  non- 
globular  domains  in  the  proteins,  respectively. 
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Figure  2  Legend.  Confirmation  of  expression  of  non-globular  domains  by  RT-PCR. 

A.  Confirmation  of  expression  of  non-globular  domains  and  proteins  in  blood  stages  by 
RT-PCR.  Total  blood  stage  RNA  (trophozoite  or  schizont)  was  amplified  by  RT-PCR.  From  left- 
to-right  the  samples  are  PFB0180w  (5'-3'  exonuclease),  PFB0130w  (prenyl  transferase), 
PFB0145c  (predicted  non-globular  protein),  PFB0265c  (RAD2  ortholog),  PFB0380c  (acid 
phosphatase),  PFB0435c  (putative  amine  transporter),  PFB0500c  (RAB  GTPase),  PFB0520w 
(protein  kinase),  PFB0525w  (asparaginyl-tRNA  synthetase),  PFB0685c  (ATP-dependent  acyl 
CoA  synthetase),  PFB0720c  (origin  replication  complex  subxmit  5),  PFB0755w  (predicted  non- 
globular  protein),  and  PFB0220w  (ubiquinone  biosynthesis  methyltransferase).  G,  T,  and  S 
indicate  amplification  from  3D7  genomic  DNA,  trophozoite  cDNA,  and  schizont  cDNA, 
respectively.  The  PFB0220w  gene  contains  an  intron  and  was  used  as  a  control;  the  primers  for 
this  gene  spanned  the  intron,  and  the  smaller  size  of  the  products  obtained  from  blood  stage  RNA 
confirm  that  products  observed  in  the  RT-PCR  reactions  were  due  to  amplification  from  RNA. 

B.  Multiple  alignment  of  the  predicted  5-3'  exonuclease  (PFB0180w)  encoded  in 
chromosome  2  with  homologous  bacterial  exonuclease  domains  showing  the  large  non-globular 
insert  in  Plasmodium.  The  alignment  was  constructed  using  the  profile  alignment  option  of 
CLUSTALW  The  alignment  column  shading  is  based  on  a  100%  consensus,  which  is  shown 
underneath  the  alignment;  h  indicates  hydrophobic  residues  (A,C,F,I,L,M,V,W,Y;  yellow 
background),  u  indicates  "tiny"  residues  (G,  A,  S;  green  background),  o  indicates  hydroxy 
residues  (S,  T),  c  indicates  charged  residues  (D,E,K,R,H),  and  "+"  indicates  positively  charged 
residues  (K,R;  purple  coloring).  The  aspartates  involved  in  metal  coordination  are  shown  by  red 
background  and  inverse  type.  Secondary  structure  elements  derived  from  the  crystal  structure  of 
Thermus  aquaticus  DNA  polymerase  57  are  shown  above  the  alignment  (H  indicates  a-helix,  and 
E  indicates  extended  conformation,  or  b-strand).  5'-3-exo_Aae  is  a  stand  alone  exonuclease  from 
Aquifex  aeolicus,  and  the  remaining  bacterial  sequences  are  the  N-terminal  domains  of  DNA 
polymerase  I. 
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Figure  2A. 
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Figure  3  Legend.  Multiple  sequence  alignment  of  rifins  encoded  on  chromosome  2. 

The  predicted  coding  regions  were  aligned  with  CLUSTALW  56  using  the  default 
settings.  The  alignment  column  shading  is  based  on  a  95%  consensus,  which  is  shown 
underneath  the  alignment;  h  indicates  hydrophobic  residues  (A,C,F,I,L,M,V,W,Y;  yellow 
background),  p  indicates  polar  residues  (D,E,H,K,N,Q,R,S,T;  red  coloring),  b  indicates  big 
residues  (F,I,L,M,V,W,Y,  K,R,Q,E;  gray  background),  and  “+“  indicates  positively  charged 
residues  (K,R;  red  coloring).  The  cysteines  conserved  in  subsets  of  rifins  are  shown  by  blue 
shading  and  inverse  coloring. 
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Figure  4.  First-round  purification  of  chromosome  14. 
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Figure  5.  False  color  image  of  a  chromosome  2  microarray. 

Microarray  hybridized  to  P.  falciparum  Cy3-labeled  schizont  cDNA  (green)  and  Cy5- 
labeled  ring  stage  cDNA  (red). 
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Chromosome  2  of  Plasmodium  falciparum  was  sequenced;  this  sequence  con¬ 
tains  947,103  base  pairs  and  encodes  210  predicted  genes.  In  comparison  with 
the  Saccharomyces  cerevisiae  genome,  chromosome  2  has  a  lower  gene  density, 
introns  are  more  frequent,  and  proteins  are  markedly  enriched  in  nonglobular 
domains.  A  family  of  surface  proteins,  rifins,  that  may  play  a  role  in  antigenic 
variation  was  identified.  The  complete  sequencing  of  chromosome  2  has  shown 
that  sequencing  of  the  A+T-rich  P.  falciparum  genome  is  technically  feasible. 


Malaria,  a  disease  caused  by  protozoan  par¬ 
asites  of  the  genus  Plasmodium ,  is  one  of  the 
most  dangerous  infectious  diseases  affecting 
human  populations.  Approximately  300  mil¬ 
lion  to  500  million  people  are  infected  annu¬ 
ally,  and  1.5  million  to  2.7  million  lives  are 
lost  to  malaria  each  year,  with  most  deaths 
occurring  among  children  in  sub-Saharan  Af¬ 
rica  (1).  Of  the  four  species  that  cause  malaria 
in  humans,  P.  falciparum  is  the  greatest  cause 
of  morbidity  and  mortality.  The  resistance  of 
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the  malaria  parasite  to  drugs  and  the  resis¬ 
tance  of  mosquitoes  to  insecticides  have  re¬ 
sulted  in  a  resurgence  of  malaria  in  many 
parts  of  the  world  and  a  pressing  need  for 
vaccines  and  new  drugs.  The  identification  of 
new  targets  for  vaccine  and  drug  develop¬ 
ment  is  dependent  on  the  expansion  of  our 
understanding  of  parasite  biology;  this  under¬ 
standing  is  hampered  by  the  complexity  of 
the  parasite  life  cycle.  The  sequencing  of  the 
Plasmodium  genome  may  circumvent  many 
of  these  difficulties  and  rapidly  increase  our 
knowledge  about  these  parasites. 

The  P.  falciparum  genome  is  — 30  Mb  in 
size;  has  a  base  composition  of  82%  A+T; 
and  contains  14  chromosomes,  which  range 
from  0.65  to  3.4  Mb.  Chromosomes  from 
different  wild  isolates  exhibit  extensive  size 
polymorphism.  Mapping  studies  have  indi¬ 
cated  that  the  chromosomes  contain  central 
domains  that  are  conserved  between  isolates 
and  polymorphic  subtelomeric  domains  that 
contain  repeated  sequences.  P.  falciparum 
also  contains  two  organellar  genomes.  The 
mitochondrial  genome  is  a  5.9-kb,  tandemly 
repeated  DNA  molecule;  a  35-kb  circular 
DNA  molecule,  which  encodes  genes  that  are 
usually  associated  with  plastid  genomes,  is 
located  within  the  apicoplast  [an  organelle  of 
uncertain  function  in  Plasmodium  and  the 
related  parasite  Toxoplasma  (2)], 

Chromosome  2  (GenBank  accession  num¬ 
ber  AE001362)  was  sequenced  with  the  shot¬ 
gun  sequencing  approach,  which  was  previ¬ 
ously  used  to  sequence  several  microbial  ge¬ 
nomes  (3,  4),  with  modifications  to  compen¬ 
sate  for  the  A+T  richness  of  P.  falciparum 
DNA  (5).  These  modifications  included  the 


following:  the  extraction  of  DNA  from  aga¬ 
rose  under  high-salt  conditions  to  prevent  the 
DNA  from  melting  at  a  high  temperature,  the 
avoidance  of  ultraviolet  (UV)  light,  the  use  of 
the  “vector  plus  insert”  protocol  for  library 
construction,  sequencing  with  dye-terminator 
chemistry,  the  use  of  a  reduced  extension  tem¬ 
perature  in  polymerase  chain  reactions  (PCRs), 
and  the  use  of  a  transposon-insertion  method 
for  the  closure  of  gaps  that  are  very  rich  in  AT. 
The  assembly  software  was  also  modified  to 
minimize  the  misassembly  of  A+T-rich  se¬ 
quences.  The  complete  sequence  included  por¬ 
tions  of  both  telomeres  and  had  an  average 
redundancy  of  11 -fold;  colinearity  of  the  final 
sequence  and  genomic  DNA  was  proven  with 
optical  restriction  and  yeast  artificial  chromo¬ 
some  (YAC)  maps. 

Chromosome  2  of  P.  falciparum  (clone 
3D7)  is  947  kb  in  length  and  has  an  overall 
base  composition  of  80.2%  A+T.  The 
chromosome  contains  a  large  central  region 
that  encodes  single-copy  genes  and  several 
duplicated  genes,  subtelomeric  regions  that 
contain  variant  antigen  genes  ( var )  (6-8), 
repetitive  interspersed  family  (RIF)-l  ele¬ 
ments  (9)  and  other  repeats,  and  typical 
eukaryotic  telomeres  (Fig.  1).  The  terminal 
23-kb  portions  of  the  chromosome  are  non¬ 
coding  and  exhibit  77%  identity  in  opposite 
orientations.  The  left  and  right  telomeres 
consist  of  tandem  repeats  of  the  sequence 
TT(TC)AGGG  (10)  and  total  1 141  and  551 
nucleotides  (nt),  respectively.  The  subtelo¬ 
meric  regions  do  not  exhibit  repeat  oli¬ 
gomers  until  ~12  to  20  kb  into  the  chro¬ 
mosome,  where  rep20  (11)  (a  21 -bp  tandem 
direct  repeat  found  exclusively  in  these 
regions)  occurs  134  and  96  times  in  the  left 
and  right  ends  of  the  chromosome,  respec¬ 
tively.  The  sequence  similarity  that  was 
observed  between  the  subtelomeric  regions 
supports  previous  suggestions  that  recom¬ 
bination  between  chromosome  ends  may  be 
one  mechanism  by  which  genetic  diversity 
is  generated.  A  region  with  centromere 
functions  could  not  be  identified  on  the 
basis  of  sequence  similarity  to  S.  cerevisiae 
or  other  eukaryotic  centromeres  (12).  How¬ 
ever,  several  regions  of  up  to  12  kb  are 
devoid  of  large  open  reading  frames 
(ORFs)  and  might  contain  the  centromere. 
Alternatively,  centromeric  functions  may 
be  defined  by  higher  order  DNA  structures 
and  chromatin-associated  protein  complex¬ 
es  (13). 

Two  hundred  and  nine  protein-encoding 
genes  and  a  gene  for  tRNAGlu  (Fig.  1  and 
Table  1)  were  predicted  (14)  on  chromosome 
2,  giving  a  gene  density  of  one  gene  per  4.5 
kb,  which  is  a  value  between  that  observed  in 
yeast  (one  gene  per  2  kb)  and  in  Caenorhab- 
ditis  elegans  (one  gene  per  7  kb).  Of  the  209 
protein-encoding  genes,  43%  contain  at  least 
one  intron.  This  percentage  is  an  estimate 
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Fig.  1.  Gene  map  of  P.  falciparum  chromosome  2.  Predicted  coding  regions  are  shown  Genes  identification  numbers  correspond  to  PF  numbers  in  Table  2.  The  letters  CC,  NG,  and 

on  each  strand.  Exons  of  protein-encoding  genes  are  indicated  by  rectangles,  and  lines  TM  followed  by  numerals  indicate  the  number  of  predicted  coiled-coil,  nonglobular,  and 

linking  rectangles  represent  introns.  The  single  tRNAGlu  gene  is  indicated  by  a  cloverleaf  transmembrane  domains  in  the  proteins,  respectively, 
structure.  Genes  are  color-coded  according  to  broad  role  categories  as  shown  in  the  key. 


because  some  introns  may  have  been  missed 
by  the  gene-finding  method.  Most  spliced 
genes  consist  of  two  or  three  exons.  In  terms 
of  intron  content  and  gene  density,  the  Plas¬ 
modium  genome,  which  was  assessed  by  the 
analysis  of  the  first  completed  chromosome 
sequence,  appears  to  be  intermediate  between 
the  condensed  yeast  genome  and  the  intron- 
rich  genomes  of  multicellular  eukaryotes. 

The  proteins  encoded  in  chromosome  2 
(Table  2)  fall  into  the  following  three  cate¬ 
gories:  (i)  72  proteins  (34%)  are  conserved  in 
other  genera  and  contain  one  or  more  distinct 
globular  domains;  (ii)  47  proteins  (23%)  be¬ 
long  to  Plasmodium -specific  families  with 
identifiable  structural  features  and,  in  some 
cases,  known  functions;  and  (iii)  90  predicted 
proteins  (43%)  have  no  detectable  homologs, 
although  many  contain  structural  features 
such  as  signal  peptides  and  transmembrane 
domains.  Homologs  outside  Plasmodium 
were  detected  for  87  (42%)  of  the  209  pre¬ 
dicted  proteins.  These  include  proteins  in  the 
first  category,  in  addition  to  those  proteins  in 
the  second  category  that  possess  a  conserved 
domain  or  domains  that  are  arranged  in  a 
manner  unique  to  Plasmodium.  The  percent¬ 
age  of  evolutionarily  conserved  proteins  is 
about  two  times  lower  than  that  found  for 
other  genomes,  mainly  because  most  of  the 
remaining  proteins  were  predicted  to  consist 
primarily  of  nonglobular  domains  (15)  (Table 
1).  The  abundance  of  nonglobular  domains  in 
Plasmodium  proteins  is  very  unusual;  the 
proportion  of  proteins  with  predicted  large 
nonglobular  domains  in  other  eukaryotes, 
such  as  S.  cerevisiae  (Table  1)  or  C.  elegans 
(16),  is  approximately  half  that  observed  in 
Plasmodium.  Furthermore,  13  of  the  87  con¬ 
served  proteins  on  chromosome  2  appear  to 
contain  large  nonglobular  structures  (>30 
amino  acids)  that  are  inserted  directly  into 
globular  domains,  as  determined  by  align¬ 
ment  with  homologs  from  other  species. 

To  determine  whether  nonglobular  do¬ 
mains  and  proteins  are  expressed  in  P.  falci¬ 
parum :,  we  performed  a  reverse  transcriptase 
(RT)-PCR  on  11  nonglobular  domains  and 
on  two  genes  that  encoded  predominantly 
nonglobular  proteins,  using  total  blood-stage 
RNA  as  a  template.  In  all  cases,  RT-PCR 
products  were  the  same  size  as  those  that 
were  amplified  from  genomic  DNA,  and  the 
sequence  of  RT-PCR  products  matched  the 
genomic  DNA  sequence  (17).  Thus,  it  is  like¬ 
ly  that  most,  if  not  all,  predicted  nonglobular 
domains  in  chromosome  2  genes  are  ex¬ 
pressed.  One  example  of  the  insertion  of  a 
nonglobular  domain  into  a  well-defined  glob¬ 
ular  domain  is  seen  in  a  protein  containing  a 
5 '-3'  exonuclease  (Fig.  2).  The  alignment  of 
the  Plasmodium  sequence  with  four  bacterial 
exonucleases  revealed  a  176 -amino  acid  in¬ 
sertion  in  a  region  between  a  strand  and  a 
helix  in  the  three-dimensional  structure  of 


this  protein  (18).  This  suggests  that  eukaiy- 
otic  proteins  can  accommodate  inserts  that 
may  be  excluded  from  the  protein  core  fold¬ 
ing  without  impairing  the  protein  function. 
The  propagation  of  nonglobular  domains  in 
Plasmodium  suggests  that  such  proteins  pro¬ 
vide  specific  selective  advantages  to  the  par¬ 
asite.  A  structural  analysis  of  Plasmodium 
proteins  that  contain  nonglobular  inserts  may 
be  valuable  for  understanding  the  general 
principles  of  protein  folding. 

Of  the  87  conserved  proteins  that  are  en¬ 
coded  on  chromosome  2,  71  (83%)  show  the 
greatest  similarity  to  eukaryotic  homologs 
(Table  2).  In  contrast,  the  remaining  16  pro¬ 
teins  are  most  similar  to  bacterial  proteins, 
and  4  of  these  represent  the  first  eukaryotic 
members  of  protein  families  that  have  previ¬ 
ously  been  seen  only  in  bacteria.  At  least 
some  of  these  16  genes  may  have  been  trans¬ 
ferred  to  the  nuclear  genome  from  an  or- 
ganellar  genome  after  the  divergence  of  the 
phylum  Apicomplexa  from  other  eukaryotic 
lineages.  Several  of  these  proteins  appear  to 
contain  NH2-terminal  organellar  import  pep¬ 
tides  (19)  and  may  function  within  the  apico- 
plast  or  the  mitochondrion.  One  such  gene 
encodes  3-ketoacyl-acyl  carrier  protein 
(ACP)  synthase  III  (FabH),  which  catalyzes 
the  condensation  of  acetyl-coenzyme  A  and 
malonyl-ACP  in  type  II  (dissociated)  fatty 
acid  synthase  systems.  Type  II  synthase  sys¬ 
tems  are  restricted  to  bacteria  and  the  plastids 
of  plants,  confirming  previous  hypotheses 
that  the  Plasmodium  apicoplast  contains  met¬ 
abolic  pathways  that  are  distinct  from  those 
of  the  host  (20,  21). 

Because  the  phylum  Apicomplexa  repre¬ 
sents  a  deep  branch  in  the  eukaryotic  tree,  the 


presence  of  eukaryotic-specific  genes  in  P. 
falciparum  suggests  the  appearance  ( of  these 
genes  early  in  eukaryotic  evolution.  Most  of 
these  genes  code  for  proteins  that  are  in¬ 
volved  in  DNA  replication,  repair,  transcrip¬ 
tion,  or  translation  (Table  2)  and  include  the 
origin  recognition  complex  subunit  5,  exci¬ 
sion  repair  proteins  ERCC1  and  RAD2,  and 
proteins  involved  in  chromatin  dynamics 
(such  as  the  BRAHMA  helicase,  an  ortholog 
of  the  DRING  protein  containing  the  RING 
finger  domain,  and  chromatin  protein 
SNW1).  Furthermore,  several  eukaryotic  pro¬ 
teins  involved  in  secretion  are  encoded  in 
chromosome  2  (such  as  the  SEC61  y  subunit, 
the  coated  pit  coatamer  subunit,  and  syn- 
taxin),  suggesting  an  early  emergence  of  the 
eukaryotic  secretory  system. 

Proteins  of  the  DnaJ  superfamily  act  as 
cofactors  for  HSP70-type  molecular  chaper¬ 
ones  and  participate  in  protein  folding  and 
trafficking,  complex  assembly,  organelle  bio¬ 
genesis,  and  initiation  of  translation  (22). 
Five  proteins  containing  DnaJ  domains  are 
present  on  chromosome  2,  which  suggests 
multiple  roles  for  this  domain  in  the  Plasmo¬ 
dium  life  cycle.  Two  of  these  proteins  consist 
primarily  of  the  DnaJ  domain,  whereas  three 
of  the  five  proteins  also  contain  a  large  non¬ 
globular  domain.  Several  proteins  containing 
a  DnaJ  domain  have  been  detected  on  other 
chromosomes,  indicating  that  this  is  a  large 
gene  family  in  Plasmodium  (23).  One  of  its 
members,  the  ring-infected  erythrocyte  sur¬ 
face  antigen,  binds  to  the  cytoplasmic  side  of 
the  erythrocyte  membrane,  suggesting  that 
DnaJ  domains  perform  chaperone-like  func¬ 
tions  in  the  formation  of  protein  complexes  at 
this  location  (24).  DnaJ  domains  in  some  P. 


Table  1.  Summary  of  features  of  P.  falciparum  chromosome  2  ( P .  f.  chr  2)  and  comparison  to  S.  cerevisiae 
chromosome  3  (S.  c.  chr  3).  Protein  structural  features  were  predicted  as  described  (74).  ND,  not 
determined.  Numbers  in  parentheses  indicate  the  percentage  of  the  total  genes  or  proteins  with  the 
specified  properties. 


Description 


Number 


P.  f.  chr  2  S.  c.  chr  3 


Chromosome  length  (kb) 

Percent  G+C  content 
Exons 
Introns 

Kilobases  per  gene 

Number  of  predicted  protein-coding  regions 
Number  of  genes  with  introns  (%) 
tRNA  genes 

Class  of  proteins 

Total 

Secreted  (%) 

Integral  membrane  (%) 

Integral  membrane  with  multiple  predicted  transmembrane  domains  (%) 
Containing  coiled-coil  domains  (%) 

Containing  other  large  compositionally  biased  regions  with  predicted 
nonglobular  structure  (%) 

Completely  nonglobular  (%) 

With  detectable  homologs  in  other  species 


947 

315 

19.7 

38.6 

24.3 

40.0 

13.3 

ND 

4.50 

1.73 

209 

171 

90  (43) 

4(2.2) 

1 

10 

209 

171 

22(11) 

11(6) 

90(43) 

42  (24) 

27(13) 

21  (12) 

111  (53) 

32(19) 

155  (74) 

71  (41) 

17(8) 

6(3.5) 

87(42) 

145  (85) 

1128 


6  NOVEMBER  1998  VOL  282  SCIENCE  www.sciencemag.org 


Table  Z  Identification  of  genes  on  P.  falciparum  chromosome  2.  The 
PF  number  is  the  systematic  name  assigned  according  to  a  method  adapted  from 
5.  cerevisfae  (14).  The  description  contains  the  name  (if  known)  and  prominent 
features  of  the  gene.  The  table  includes  genes  with  homologs  in  other  species  and 


members  of  Plasmodium  gene  families.  An  expanded  version  of  this  table  with 
additional  information  is  available  on  the  World  Wide  Web  at  www.tigr.org/tdb/ 
mdb/pfdb/pfdb.htmL  Prt,  protein;  OO,  organellar  origin;  TP,  transit  peptide;  ATP, 
adenosine  triphosphate;  euk.,  eukaiyotiq  nt,  nucleotide. 


PF  number 

Description 

PF 

number 

Description 

Amino  acid  biosynthesis 

Regulatory  functions 

PFB0200c 

Aspartate  aminotransferase 

PFBOISOc 

Ser/Thr  prt  kinase 

Biosynthesis  of  cofactors,  prosthetic  groups,  and  carriers 

PFB0510w 

GAF  domain  prt  (cyclic  nt  signal  transduction) 

PFB0130W 

Prenyl  transferase 

PFB0520w 

Novel  prt  kinase 

PFB0220W 

Ubiquinone  biosynthesis  methyltransf erase 

PFB0605w 

Ser/Thr  prt  kinase 

Fatty  acid  and  phospholipid  metabolism 

PFB0665w 

Ser/Thr  prt  kinase 

PFB0385w 

Acyl-carrier  prt 

PFB0815w 

Calcium-dependent  prt  kinase  (C -terminus  EF  hand) 

PFB0410c 

Phospholipase  A2-like  a/b  fold  hydrolase 

Transport 

PFB0505C 

3-ketoacyl  carrier  prt  synthase  III,  FabH  (00,  TP) 

PFB0210c 

Monosaccharide  transporter 

PFB0685C 

ATP-dependent  acyl-CoA  synthetase  (TP) 

PFB0275W 

Membrane  transporter 

PFB0695C 

ATP-dependent  acyl-CoA  synthetase  (TP) 

PFB0435c 

Predicted  amino  transporter 

Purines,  pyrimidines,  nucleosides,  and  nucleotides 

PFB0465c 

Membrane  transporter 

PFB0295w 

Adenylosuccinate  lyase  (OO) 

Cell  surface 

DNA  metabolism 

PFBOOlOw 

var  gene 

PFB0160w 

ERCC1  -like  excision  repair  prt 

PFB0015c 

Rifin 

PFB0180w 

Prt  with  5'-3'  exonuclease  domain  (OO,  TP) 

PFB0020C 

var  gene  fragment 

PFB0205C 

Prt  with  5'-3'  exonuclease  domain  (Kem-1  family) 

PFB0025c 

Rifin 

PFB0265C 

RAD2  endonuclease 

PFB0030c 

Rifin 

PFB0440c 

Chromatinic  RING  finger  prt,  DRING  ortholog 

PFB0035C 

Rifin 

PFB0720c 

Origin  recognition  complex  subunit  5  (ATPase) 

PFB0040c 

Rifin 

PFB0730W 

BRAHMA  ortholog  (DNA  helicase  superfamily  II) 

PFB0045c 

var  gene  fragment 

PFB0840W 

Replication  factor  C,  40-kDa  subunit  (replication  activator) 

PFB0050c 

Rifin  pseudogene 

PFB0875C 

Chromatin-binding  prt  (SKI/SNW  family) 

PFB0055C 

Rifin 

PFB0895C 

Replication  factor  C,  1 40-kDa  subunit  (ATPase) 

PFB0060W 

Rifin 

Energy  metabolism 

PFB0065W 

Rifin 

PFB0795w 

ATP  synthase  alpha  chain 

PFBOIOOc 

Knob-associated  His-rich  prt 

PFB0880W 

FAD-dependent  oxidoreductase  (OO) 

PFB0300C 

Merozoite  surface  antigen  MSP-2 

Transcription 

PFB0305C 

Merozoite  surface  antigen  MSP-5  (EGF  domain) 

PFB0140w 

Metal-binding  prt  (DHHC  domain) 

PFB0310c 

Merozoite  surface  antigen  MSP-4  (EGF  domain) 

PFB0175C 

Prt  of  the  MAK16  family 

PFB0400W 

Pf$230  paralog  (predicted  secreted  prt) 

PFB0215C 

Prt  with  Egl-like  3 '-5'  exonuclease  domain 

PFB0405w 

Transmission-blocking  target  antigen  PfS230 

PFB0245c 

RNA  polymerase  16-kD  subunit,  RPB4-like 

PFB0570W 

Predicted  secreted  prt  (thrombospondin  domain) 

PFB0255w 

RRM-type  RNA-binding  prt 

PFB0760w 

Mtn3/RAG  1 1P-like  prt 

PFB0290C 

Zn-ribbon  transcription  factor  (TFIIS  family) 

PFB0915w 

RESA-H3  antigen 

PFB0370c 

RNA-binding  prt  (KH  domain) 

PFB0955W 

Rifin 

PFB0445C 

elF-4A— like  DEAD  family  RNA  helicase 

PFB0975c 

var  gene  fragment 

PFB0620W 

YOU2-like  small  euk.  C2C2  Zn  finger  prt 

PFBIOOOw 

Rifin  pseudogene 

PFB0715w 

DNA-directed  RNA  polymerase  subunit  2 

PFB1005W 

Rifin 

PFB0725c 

Meta-binding  prt  (DHHC  domain) 

PFBIOlOw 

Rifin 

PFB0855C 

rRNA  methylase  (SpoU  family)  (OO,  TP) 

PFB1015w 

Rifin 

PFB0860c 

RNA  helicase 

PFB1020W 

Rifin 

PFB0865w 

Small  nuclear  ribonucleoprt.  (SNRNP  family) 

PFB1025W 

var  gene  fragment 

PFB0890c 

Pseudouridine  synthetase  (RsuA  family);  first  euk.  member 

PFB1030w 

var  gene  fragment 

(OO) 

Translation  and 

post-translational  modification 

PFB1035w 

Rifin 

PFB0165w 

tRNA-Glu 

PFB1040W 

Rifin 

PFB0240w 

PINT  domain  prt  (proteasomal  subunit) 

PFB1045W 

var  gene  fragment 

PFB0260W 

PSD2-like  26S  proteasomal  subunit 

PFB1050W 

Rifin 

PFB0325c 

SERA  antigen/protease  with  active  Cys 

PFB1055c 

var  gene 

PFB0330c 

SERA  antigen/protease  with  active  Cys 

Other  cellular  processes 

PFB0335c 

SERA  antigen/protease  with  active  Cys 

PFB0085c 

Prt  with  DnaJ  domain  (RESA-like) 

PFB0340c 

SERA  antigen/protease  with  active  Ser 

PFB0090c 

Prt  with  DnaJ  domain 

PFB0345c 

SERA  antigen/protease  with  active  Ser 

PFB0450w 

Prt  translocation  complex,  SEC61  7  chain 

PFB0350C 

SERA  antigen/protease  with  active  Ser 

PFB0480W 

Syntaxin 

PFB0355c 

SERA  antigen/protease  with  active  Ser 

PFBOSOOc 

RAB  GTPase 

PFB0360C 

SERA  antigen/protease  with  active  Ser 

PFB0595W 

Prt  with  DnaJ  domain,  DNJ1/S1S1  family 

PFB0380c 

phosphatase  (acid  phosphatase  family) 

PFB0635W 

T-complex  prt  1  (HSP60  fold  superfamily) 

PFB0390W 

Ribosome  releasing  factor  (OO,  TP) 

PFB0640C 

WEB-1  ortholog,  WD40 

PFB0455w 

Ribosomal  prt  L37A 

PFB0750w 

VPS45-like  prt  (STXBP/UNC- 1 8/SEC  1  family) 

PFBOSISw 

Glycosyl  transferase  (novel  euk.  family) 

PFB0805c 

Clathrin  coat  assembly  prt 

PFB0525w 

Asparaginyl-tRNA  synthetase  (OO,  TP) 

PFB0920W 

Prt  with  DnaJ  domain  (RESA-like) 

PFB0545c 

Ribosomal  prt  L7/L  12  (00) 

PFB0925W 

Prt  with  DnaJ  domain  (RESA-like) 

PFBOSSOw 

Euk.  peptide  chain  release  factor 

Unknown  function 

PFB0585w 

Leu/Phe-tRNA  prt  transferase,  first  euk.  member  (OO) 

PFB0270W 

SLR1419  family  prt  (OO) 

PFB0645c 

Ribosomal  prt  LI 3  (OO) 

PFB0320C 

HesB  family  prt  (possible  redox  activity,  OO,  TP) 

PFB0830W 

Ribosomal  prt  S26 

PFB0420W 

YgdB  prt  first  euk.  member  (OO,  TP) 

PFB0885w  * 

Ribosomal  prt  S30 

PFB0425C 

YMR7  family  prt 

www.sciencemag.org  SCIENCE  VOL  282  6  NOVEMBER  1998 


1129 


PFB0180W 

DPO 1_THEAQ_1 1 8 828 
5 ' -3-exo_Aae_2983968 
DPO 1_BACCA_4 16913 
DPOl_ECOLI_l 18825 
consensus/100 % 


EEE E EEHHHHHHHHHHHH . HHHHHHHHHHHHHHH .  .  .  EEEEEE . HHHHHHHHHHHHHHHHH .  .  EE 

etflxv^ssilfknffEmpflkndndvktlstiy|fiqslnkiymlflptyiaiifd|ktsnndkkkiyanykifrrkn|delyeqlkivsnfcdtigikt 

GRVLLV^HHLAYRTFH^iKGLTTSRGEPVQAVY|FAKSLLKALKEDG-DAVIWFD|KAPSF-RHEAYGGYKAGRAPTgEDFPRQLALIKELVDJMLGLAR 
KTLYILffflsSFVYRSFFgLPPLSTSKGFPTNAIY^LRMLFSLIKKERPQYLVVVFDgPAKTK-REKIYADYKKQRPKAgDPLKVQIPVIKEILKLAGIPL 
KKLVLlffis  S  VAYRAFf|lPLLHNDKGIHTNAVy|fTMMLNKIIxAEEE  PTHMLVAFD0GKTTF  -  RHEAFQEYKGGRQQTgPEL  S  EQFPLLRELLRAYRI  PA 
NPLI  Lvffis  S  YLYRAYH|FPPLTNSAGEPTGAMYKVLNMLRSLIMQYKPTHAAWFDgKGKTF  -  RDELFEHYKSHRP  PMgDDLRAQI  E  PLHAMVKAMGL  PL 
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PFB0180w 

DP01_THEAQ_1 18828 
5 ' -3-exo_Aae_2983968 
DP01_BACCA_4 16913 
DPOl_ECOLI_l 18825 
consensus/ 100% 


E . HHHHHHHHHHHHHHHH _ _ _ _ _ 

XSSTNIE^DYlgRIVDNISNTLKEKKQKDFSFVRfNHQEKEPPPMYTYMKNWYDNAGSIGTNKIFDKEPNHINGNINGNVNDHTNGNVNDHINGNINDHIN 

levpgye®dvi3slakkaekeg - 

lelpgyeJSdviSylaekfsqkg - 

YELENYE^DI  IgTIAARAEQEG - 

lavsgve®dvi|tlareaekag - 

h.  .  .  .  hE5EDhl|  -  lh  .  .h . 


PFB0180W 

DP01_THEAQ_1 18828 
5 1 -3-exo_Aae_2983968 
DPO 1_J3ACCA_4 16913 
DPO l_ECOLI_l 18825 
consensus /100% 


non-globular  insert 


EEEEE 


GNINDHINDHTNDHTNDHTNDHTNDHTNDHTNDHLNDYEYYEYYNTNDDDHYNINDDDHYHINDDAYNNFYDNIYAEENVSCHENVATNNIDKKKKFRVIW 

_ YEVRIL 

_ FKVKXY 

_ FEVKVI 

_ RPVLIS 

. V.l. 


helix-hairpin-helix  domain 


PFB0180W 

DPO 1_THE AQ_1 1 8828 
5 ' -3-exo_Aae_2983968 
DPOl_BACCA_4 16913 
DPOl„ECOLI_118825 
consensus/ 100% 

Fig.  2.  Multiple  alignment  of  the  predicted  5'-3'  exonuclease 
(PFB0180w)  encoded  in  chromosome  2  with  homologous  bacterial  exo¬ 
nuclease  domains  showing  the  large  nonglobular  insert  in  Plasmodium. 
The  alignment  was  constructed  with  the  profile  alignment  option  of 
CLUSTALW  (34).  The  alignment  column  shading  is  based  on  a  100% 
consensus,  which  is  shown  underneath  the  alignment;  h  indicates  hydro- 
phobic  residues  (A,  C,  F,  I,  L,  M,  V,  W,  and  Y),  u  indicates  “tiny"  residues 
(G,  A,  and  S),  o  indicates  hydroxy  residues  (S  and  T),  c  indicates  charged 


HHHHH . 

kcllkeyhnienilknlhkl 

RKLLEEWGSLEALLKNLDRL 
INILKKYGSVENILKNWEKF 
LLRQFGTVENVLAS I DEI 
Q ALL  QGLG  GLDTLY AE  P  EKI 
. .1L. .h. .1- .lh. . . cch 

residues  (D,  E,  K,  R,  and  H),  and  +  indicates  positively  charged  residues 
(K  and  R)  (35).  The  aspartates  involved  in  metal  coordination  have  a  red 
background  and  inverse  type.  Secondary  structure  elements  derived  from 
the  crystal  structure  of  Thermus  aquaticus  DNA  polymerase  (18)  are 
shown  above  the  alignment  (H  indicates  a  helix,  and  E  indicates  extend¬ 
ed  conformation,  or  p  strand).  5'-3'-exo_Aae  is  a  stand-alone  exonucle¬ 
ase  from  Aquifex  aeolicus,  and  the  remaining  bacterial  sequences  are  the 
NH2-terminal  domains  of  DNA  polymerase  I. 


E . EEEEEE . EEEHHHHHHHHHH .  „  .  .  HHHHHHH . 

LLQLLE YNNETYNMDI SICQPNK - KYRLVN S HLF YEEHE I L 

LYQLLSD - RIHVLHPEG - YLIT-PAWLWEKYGLRgDQWADYRALT 


|SQYSDYLILTgDKTDGisgvPYlgDK 

JDE S DNL PgVKGIgEK 

DLLQLVSE - UVLVINPMN - DEVFTKERVIKKFGVEgQKIPDYLALVgDKVDNVp|lEGVgPK 

ILTQLASP - HVTVD I TKKG I TD I E  P YT  PEAVREKYGLTgEQXVDLKGLMgDKS  DNI  PgVPGIEEK 

’GvS  E 


QLVTP - NITLINTMT - NTILGPEEWNKYGVPgELIIDFLA] 


o.E+Dh.QLh. . 


.  . .1.1. . 
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falciparum  proteins  contain  substitutions  in 
the  His-Pro-Asp  signature  that  is  required  for 
interaction  with  HSP-70-type  proteins, 
which  may  indicate  a  modification  of  the 
typical  chaperone  function. 

Chromosome  2  contains  five  protein 
families  that  are  unique  to  Plasmodium  in 
terms  of  their  distinct  domain  organization, 
although  three  of  them  contain  domains 
that  are  conserved  in  other  genera.  The 
genes  encoding  the  Plasmodium-specific 
families  are  primarily  located  near  the  ends 
of  the  chromosome.  A  single  var  gene  was 
identified  in  each  subtelomeric  region.  The 
var  genes  encode  large  transmembrane  pro¬ 
teins  (PfEMPl)  expressed  in  knobs  on  the 
surface  of  schizont-infected  red  cells. 
PfEMPl  proteins  exhibit  extensive  se¬ 
quence  diversity;  are  clonally  variant;  and 
are  involved  in  antigenic  variation,  cytoad- 
herence,  and  rosetting  (6-8).  In  addition  to 
the  full-length  var  genes,  six  small  ORFs 
were  identified  in  the  subtelomeric  regions 
that  were  similar  to  var  sequences.  Five  of 
these  ORFs  resembled  the  var  exon  II 
cDNAs  or  the  Pf60.1  sequences  that  were 
reported  previously  (7,  25). 

The  largest  Plasmodium- specific  family 
found  on  chromosome  2  encodes  proteins 
that  were  dubbed  rifins,  after  the  RIF-1  re¬ 
petitive  element.  RIF-1  contained  a  1-kb 


ORF  but  no  initiation  codon,  was  found  on 
most  chromosomes,  and  was  transcribed  in 
late  blood-stage  parasites  (9).  The  function  of 
the  RIF-1  element  was  unknown.  Eighteen 
ORFs  with  similarities  to  RIF-1  were  found 
in  the  subtelomeric  regions  of  chromosome  2, 
centromeric  to  the  var  genes.  An  inspection 
of  the  sequence  upstream  of  these  ORFs  re¬ 
vealed  exons  encoding  signal  peptides,  which 
indicated  that  the  RIF-1  elements  were  actu¬ 
ally  genes  consisting  of  two  exons.  These 
genes  encode  potential  transmembrane  pro¬ 
teins  of  27  to  35  kD,  with  an  extracellular 
domain  that  contains  conserved  Cys  residues 
that  might  participate  in  disulfide  bonding,  a 
transmembrane  segment,  and  a  short  basic 
COOH-terminus.  The  extracellular  domain 
also  contains  a  highly  variable  region  (Fig. 
3).  RT-PCR  with  schizont  RNA  showed  that 
one  of  six  rifin  genes  that  were  tested  was 
transcribed.  The  function  of  the  rifins  is  un¬ 
known,  but  their  sequence  diversity,  predict¬ 
ed  cell  surface  localization,  and  expression  in 
schizont  stages  suggest  that,  like  var  genes, 
they  may  be  clonally-variant.  Multiple  rifin 
genes  were  detected  in  the  telomeric  regions 
of  chromosomes  3  and  14,  suggesting  that 
rifin  genes  have  propagated  as  clusters  in  the 
course  of  Plasmodium  evolution  (26).  If  the 
number  found  on  chromosome  2  is  represen¬ 
tative  of  other  chromosomes,  there  may  be 


500  or  more  rifin  genes  in  the  P.  falciparum 
genome  (—7%  of  all  protein-coding  genes), 
making  it  the  most  abundant  gene  family  in 
this  organism.  The  presence  of  var  and  rifin 
genes  and  other  ORFs  in  subtelomeric  re¬ 
gions  of  P.  falciparum  chromosomes  con¬ 
firms  that  the  subtelomeric  regions  are  not 
transcriptionally  silent  (27). 

Another  family  of  membrane-associated 
proteins,  serine  repeat  antigens  (SERAs), 
contains  a  papain  protease-like  domain.  A 
cluster  of  three  SERA  genes,  which  were  all 
transcribed  in  the  same  direction  (from  cen¬ 
tromere  to  telomere),  was  known  to  be  on 
chromosome  2  (28);  at  least  one  SERA  has 
been  evaluated  for  use  in  blood- stage  vac¬ 
cines.  These  genes  are  part  of  an  eight-gene 
cluster;  seven  genes  have  a  similar  four-exon 
structure,  but  the  gene  at  the  3'  end  of  the 
cluster  contains  only  three  exons.  The  pro¬ 
tease  domains  in  these  proteins  are  unusual 
because  five  of  the  eight  contain  serine  in¬ 
stead  of  cysteine  in  the  active  nucleophile 
position,  suggesting  that  they  are  serine  pro¬ 
teases  with  a  structure  that  is  typical  of  cys¬ 
teine  proteases  (29). 

Two  proteins  (MSP-4  and  MSP-5)  that 
contain  an  epidermal  growth  factor  (EGF) 
module  in  their  extracellular  domains  were 
identified  (30,  31).  In  organisms  that  are  not 
classified  in  the  animal  kingdom,  MSP-4, 
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Signal  peptide 


Conserved  ("constant")  region 


PFB1040& 
PFB1015w 
PFB1005W 
PFB0055C 
PFB1035W 
PFBlOSOw 
PFB0015C 
PFB0040c 
PFB0030C 
PFBlOlOw 
PFB0035C 
PFB0060w 
PFBlOOOw 
PFB0025C 
PFB0065W 
PFB1020w 
PFB0955W 
consensus/ 95% 


MKLHFPKXLLFFFPSNILLTS--YHVHSKWKPYITPRH - TPTITSRVLRE0DIHK-SIYDNDEDMKSVKENFDR|lsgRFEEYEERMKGKpgRKEERDKNIQEIIEKDRMD- 

MKLHYTKILLFFFPLNILLT  S — YHAHNKNKPYI TSRH - RQTSTSRVLSESDPYM-  LNYDNDDDMKSV^ENFDRpTSuRFEEYEGRMKDKpi|^&CEQ&lKDIQEI  ILKDKME- 

MKMHYSEILFFSLSLNILITS— SYAHSENKQYITPY - TPNTSSRYIiTEHDIKM-SIYDNDGDMKSVKENFDRuTs|RFEEYDERMKDK|^§KEQ3DKDIQEIIVKDKME- 

MKLNYTKILLFFFPLNILAN - NNKNKPS I TQRH - TPRYTSRVLSESDIRS-SIYDNDAEMKSvIblTFDRpTSpRFEEYEERMKGKMQKRKEQRDKNIQEIIEKDRMD- 


MKDH  YI NI  LLFAI»PLNILV - YNQRNYYITR - TPKATTRTLHEfflELYAPATYDDDPQMKEVMDNFNRigTQpRFHEYDEP-MKTTRQJ 

MKVHYINILLFALPLNILI - YNQRNHKSTTHHT - LKIPITRLLj  E  ■  ELYAPTNYDSDPEMKRVHQQFVD)ftTT0RFHEYDNRMKDKRQ] 

MMLNYTNILLFYLSLNILSSSSEV-  -YNQRNHYITR - TPKATTRTLjEjELYAPSNYDNDPEMQKVMENYNRpTSpRFEEYNERVIKNRQl 

MKDHYINILLFALPLNILV - YNQRSYYITP - RHTETNRSL  j  E  •  ELYSPTKYD^DPEMKRvi(JQQFEDkTSpRFHEYEERMQSKRM( 

MKVHYINI LLFALPLNILE - HNKNE PHTTPHH  -  - ' - PPNT-RLLj  ESELYSPANYDSDPEMKRVHQQFVDRTTpRFHEYDERMKTTRQ] 

MKVHYMNI LLFAL PLNILE - HNERDHNNTTLH - TSIT-RSL0EFELYEPANYDNDQEMKEVwQQFEv|ts|rFHEYDESLOSKk^ 

MKLYYSKILLFSLILNILVPSSYA--HNKNKQYISART - PTITSRMLSEHdINT-  SI  YDDDTEMKFvfpNFDR|Ts|RFEE  YNERLLEI#21 

MKVHYINI LLFTLPLNILVNG - QGHYSSTKHP - ISSTKSSKYHRSlBeBeIYT-SIYDNDPEMKKVMQDFDQPTSQRLREYDERLIKNRQJ 

kKVHCYNILLFSFTLirLLLSPSQV-NNQMNHYNTAN--MKNTEPIKSYREL0uSELYT-SIYDDDPEMKEI«HDFDRpTSpRFEEYNERiy[NKN)EtQ] 
MKMYYLKMLLFTFLINTLVARHYE— NFVNNHYNVSLIQNKTKRVTIKSRLLAQTQIHN-PHYHNDPELKEilDKMNEiAllKYQQTH - DPYK( 


IDQ FDKE I QK 1 1 LKDKLEKELMDKFATLQTD I QND 
DKgDKE IQKI ILKDKLEKELMDKFATLQTDIQND 
EQjDKEIQKII LKDKLEKELMNKFATLQTDIQ SD 
EQ j  DKEIQKI ILKDKLEKELMDKFDTLHTDIQSD 
DK | DKE IQNI I LKDKLEKQMEQQLTTLETKIDTN 
DQ • DKEIQKI ILKDKLEKHMAQQLSTLETRITTD 
EK  gDKEIQKI ILKDKLEKELMDKFATLQTDIQND 
DQj  DKDIQKI I LKDK IEKELTKQLEALEVDITTE 
EQ@DRD I KNI I LKDK IEKELKQQLATLETDI STD 
JLKEWEKNGSQNRSGHVAEPMSTLEKELLETYVETF 


MKMYYLKMLLFTFLINTLV - . - LIQNNTQRTTINSRLLAQTQNKN-PHYHNDPELKFIIDKLNEEAIRKYQQTH - Dp|eQLKDWEKNGTKHVGGHVSEPMSTIEKELLETYEDVF 

MNKYYVKMLLFAFLINTLVLPHYENYLNNHYNVC--LIQNKTKRTTINSRLLAQTKNHN-PHYHNDPELKEi;XDKMNEBAliK^QKSH - DpYbQLKEWEKNGTIYTGGNGAEPMSTTEKDLLETYKEVF 

MNIYYINMLVMSILLIVLFLSYNVNNHNKKYNVG--YIQNNRQMIMMKSRRLAEIQLPKCPHYNNDPELKKliDKLNE|Rl|KYIETN - NSi||LHGLLVKERTKSLYENGMKKSSNMEKELLKKYDDSI 
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Variable  region 


PFB1040W 
PFBl015w 
PFB1005W 
PFB0055c 
PFB1035W 
PFBlOSOw 
PFB0015C 
PFB0040c 
PFB0030C 
PFBlOlOw 
PFB0035C 
PFB0060W 
PFBlOOOw 
PFB0025C 
PFB0065w 
PFB1020W 
PFB0955w 
consensus/ 95% 


- kslaekvekc  lijgcg: 

- KSLAEKVEIG  LRjGCG: 

- KSLAKKVEKG  LRjGCG: 

- KSLAEKVEKV  LRjGFG: 

AIPtSi  EKSLADKVEKT  LRjGSV 

aiptSv  eksladkvekv  frjggl: 

AIPTS V  EKSVADKVEKT  LKjGGV 

aiptS  v  eksladkvekg  lrjgyg: 
DIPlw  V  eksladktekf  lnjgvq: 
dipiSv  eksmadkvekg  lr  j  gci: 

AIPTS V  EKSLADKTEKF  LNjGVQ! 
DIPASV  KKSVEDKVGKN  LKjGGI. 
DIP  TgV§N  KS  VADKVE  KTgL  KgGGV 
GEESNIMLKSGRYQNGDDVSDDSSSl 
GDKNHVMLKSGRYPNDDDKSDDSSS 


ICGLG - GVAASVGI FG - GIAISELKKAAMIAAIASAQKTGVLAGEAARIP - AGIKAVIAGLKRMGISTLGGKDLGSYFATTDYTNFKTIARVINSEYQTDS  LIG 

ICGLG - GVAASVG I FG - TVAVKELAKTATAAAVAAAQEAVKDAAMAATIKAVGAAAGKEFVIAGLKQMGVSTLDGKELGTYITATNYTNVKNIAHAINTQYEPSS  LIT 

ICGLG  -  GVAASVGI  I G - PIAVNEVKK - AALVAAAQKGIEVGMAKAIE - ELGKIVG - LSDFSYLNWSAMITATTYYKPMKLVNIVNSANS--M  TDS 

IFGLG  -  GVAAGVG I  FG - AIAVNEWTK - AALVAAAQKG I  DAG  I  KS  ALK - GLEKIYE - LSDFSYLKWSAMVTPTTYDQPMDLIAIVTKAYN— M  DDV 

ISVFGGGI TPGWGL I S GLGYVGWTNYITEIAIQKGI EAGVKAGIQELKG - FAGLSRLINF- - SE I KNL INHTNYFKEMTYVSFLQDANKT - H  SA- 

IGLLGGGIAPGWGLVS GLGYVGWTNYVTQTALQKGI EA -VI SYLEQ IPG - IKGLPGFN - LANIVNPNNY S SGGLLTTAI DAAARP  - 1  SV- 

IGVLGSGIAPSVGLLG TVAIDQWTNAALLDAAQKGIQAGIDTWAELEY - VAERFDDIGI - N I VGMINKETYRC PQALI E S I YAAKQK - V  DN- 

1YGLG-TVAPTVGLIG AI AVNEWTKAATAAATQKG I EAGINWIDTLKR - LFNIEWTDL - KWKTL I TAQNYTDKI LVGDVIRKLGNT -  L  GG- 

1VQLGGGVLQASGLLGGIGQLGLDAWKAAALVTAKELAEKAGAAKGLAEGNA - HGMKIVIHHLKELHIDKLVPGICEKISSTGHYANITNFANTIIQQRGT-M  GAS 


ICILG-AAMPELGSVGGSLLYALNTWKPVALKAAIAAANKAGMAAGIKAGDA - AGMNWI VQLGKWGINE  FCPEIFESILK INHY  SKLKDF  ASAI VAEHDK  -  IHAI T 

rVQLGGGVLQASGLLGGIGAVAVNAWKDAALEAAIDFATEAGAAAGVAAGEA - AGKAWI KSLKYFRVDVFF  PKI FNS IGNAI PYYDAKT I GAAI AEKHAQ  -NggALV 

!G I LG - GGI PGLGVLG AYAVNSMVQVAMDAAKKAAI AEGAEAGI AEG I K - VAIQGVPKKFLLYTLN - GKELQAVINANNFQNP S FF YGE IMAEYVS - WKKSD 

IGVLG - GAVPELGLLC GYGAYELVKVAI GAAEKAAI AEGAKAG I AEG I R - VAIKG I KDAFNI EFLD - GKTLAEVI TGKTFNNSTFFVEKFVQE  YNT- vSl S S 

s  sSdStdinnaklektkgrdkylk - hlkgrStrgiy-  FSSAG 

:  s  sHewtdvnntklektkgkdkylk - HLKHRSI GGIC  -  SHSVG 


ddes  -dmlksgmsqnvdeks — st  j  ejJtdingakltktkgkdkylk - . - hlkgr  Strgic-vS 

RDEHNVI SKSGIYT-  SDYRKLYDKSgDYQNQKILRDELASCCKVHDNYLD - ; - NLKKG0FGGVG  -  Ig 


Variable  region 


Transmembrane  region  Intracellular  region 


PFB1040w 
PFB1015W 
PFB1005W 
PFB0055c 
PFB1035W 
PFBlOSOw 
PFB0015C 
PFB0040C 
PFB0030C 
PFBlOlOw 
PFB0040c 
PFB0060w 
PFBlOOOw 
PFB0025C 
PFB0065W 
PFB1020W 
PFB0955w 
consensus /9 5% 


— TSGE-NS1 
— STNE-GA3 


\NFVAPQDSPGKGGSVYKSIETAVKSIVTDAETVAQRAVENATEEVIKNSTAAAESTY- 

^KEGAARVIQGKQFSTQETIKVAVTSIVSDAENVAAAAEQQATKDAIKASTLAVDSKY- 

- : - RIN  SEV  S  S  SRFTEVI SQE  AAKAAS  AAGE  AAKN  AEKAQI ALVNEE  S  - 

- GI ANEPDG - CPVKTFSQMAVDAAEAAGKV SKTTEEAGI AL ANNT  S - 

- HNGES ALSKRAAG I ADYAADMAKITEEGVLEEGAS AT - 

- QNGGS I I AKVSVDAENAANAG I DAASAEAANLAPKT - 

- QDGT S I WFRPEVLKATQDGIDAAETVEKAE IVLINEES - 


GPATDKSKTiSnWVRANFVAPQDSPGKGGSVYKSIETAVKSIVTDAETVAQRAVENATEEVIKNSTAAAESTY - AGCQTAI I ASWAI 1 1 IALVMI 1 1 YLVLRYRRKKKRIKKKAJBYTKLLNQ 

VPVD--SKPI  rWVRAKEGAARVIQGKQFSTQETIKVAVTSIVSDAENVAAAAEQQATKDAIKASTLAVDSKY - AICQNAI I AS WALLI IVLIMI 1 1 YLVLR Y RRK KIVIKKKREYT K L LNQ 

-npafts-lf  kasy - - - rinsevsssrftevisqeaakaasaageaaknaekaqialvnees - ahlysaigysviailiillvmviiylxlryrrkkkmnkkl|ytkllnq 

-EAAKGS-LF  QAME - GIANEPDG-CPVKTFSQMAVDAAEAAGKVSKTTEEAGIALANNTS - YNSYIAIAYSVTAILIIVLIMLIIYLILRYRRKKKMNKKLQYTKLLNQ 

— RPTSKEIF  NFVS - HNGESALSKRAAGIADYAADMAKITEEGVLEEGASAT - SSLTTAIIASIXAIWIILIMIIIYLVLRYLRKKKIIKKKLEYIKLLKE 

— NHSKTPAF  SYAT - QNGGS  I IAKVSVDAENAANAG I  DAASAEAANLAPKT - LTLTNTIIVSFVAIVVIVLVMLIIYFILHYRRKKKEKKKT..QYIKLLKE 

— VGNPA-PT  HRVG - QDGTSIWFRPEVLKATQDGIDAAETVEKAEIVLINEES - AHLYSAIGYSVLAILIIVLVMLIXYLXLRYRRKKKMKi:KL|YIKLLEE 

— SEDTAGGF  LFTV - KANTLPQAINGHVTKAI SEGTAEWKVTEAEMGKVTTSA - GAYSTG I IV SWAIWIVLIMI 1 1 YL I LRY RRK RKHTKKt 'QFI-.'Kr . LNE 

— GKNLGKDM  TKISIKLGT - LKPDGIRPGLPDKDAVTKVLNGLVEQADKAAAHVTKTTSESVTAAIKARETALIEGRFESSITSINASIIAIIVIVLIMVIIYLILRYRRKKKI'IKICKLQYIKLLEE 

— TSGE-NSM  LPFDIALGL - SDAKGTPIGPPASQAIPKMMNQLVGKAKGTADFMANKVNSETYSKIITKQADLIEAGFNSCTTSIYASIXVILIIVLIMVIIYLILRYRRKKKMKitKLQYIKLLEE 

—  STNE-GA^J  YPFEVNLGI — REAITFTQTGPPAKYAIPDTVSEIVEGAEQAAKAAAKAAEKGVTAAIKAKETRLLEAGFNSSISSINASXXAIWIILIMVIIYLILRYRRKKRMKKKh|yIKLLEE 

-cMVNSYGLFSFIEESCE - NNPDKIMKFILANSNDIAKDAGKAATKMTTQTTEALTLKKTAEATSTS - AIFSNPIVISFXVLVIIVLILLIIYLILRYRRKRKMId^KL]QYLKLLKE 

— TTYQDTLFSDYGSMFG - GKVDNITAISLNAKN-TAIKAGQAAAKMTTETTKALTAEKTGEVTSTS - AIFSNPMVISFIVWIIVIILLIIYLILRYRRKKKMKitKL^YlKLLE- 

S-ALLTLIALIAAKKAALSAVAS--YAGFKnHmSSIATFKLLDSSTLLSSFLSMK--ACWGATDMAGTIATP - AMAAFYPYGI AALVLLI LAWLI ILYIWLYRR  RK l SWK) :EC K K; i LCK 

s-afltilgcafaksaaltafas — sestktBissvaiynlfqnstmlsalktvgg-tcangapdiagtvstl - ASAAFPPYGIAALVLLILAVALIILYIWLYRR-RKNSWKHECKKHLCR 

S-VFLTLIGLITAKNAAVAAVTSSFNEASKiSASSISVLHMFTHESVTLSMPSVTAAGGVECFSDLAGTISSA - AMGVFEPCGIAALVLLILAWLIILYIWLYRR-RKNSfKHiCKKHLCK 

S-LLVSNIGIGYAVTAAKEVITG — LYSLDIANKFTKALAG-IYFFFSSSIENAGVSGVTIFYWDSMRMASIA - SSTINPYGIAALVLIVLVWLIVLYIWLYRR  RKKSWKHECKKHLST 

. hhhl 1 1 lhhlbl I lhhlhYRR . +  + . pbK . bh . K . Lp . 


Fig.  3.  Multiple  sequence  alignment  of  rifins  encoded  on  chromosome  2. 
The  predicted  coding  regions  were  aligned  with  CLUSTALW  (34)  using 
the  default  settings.  The  alignment  column  shading  is  based  on  a  95% 
consensus,  which  is  shown  underneath  the  alignment;  h  indicates  hydro¬ 


phobic  residues  (A,  C,  F,  I,  L,  M,  V,  W,  and  Y),  p  indicates  polar  residues 
(D,  E,  H,  K,  N,  Q,  R,  S,  and  T),  b  indicates  "big"  residues  (F,  I,  L,  M,  V,  W,  Y, 
K,  R,  Q,  and  E),  and  +  indicates  positively  charged  residues  (K  and  R)  (35). 
The  cysteines  conserved  in  subsets  of  rifins  are  shown  by  inverse  type. 


MSP-5,  and  MSP-1  (a  multi-EGF  domain 
protein  encoded  on  chromosome  3)  and  two 
Plasmodium  sexual-stage  antigens  (32)  are 
the  only  proteins  that  contain  EGF  repeats, 
which  suggests  that  Plasmodium  obtained  the 
sequence  for  this  domain  from  its  animal 
host.  The  plasmodial  EGF  domains  may  be 
involved  in  parasite  adhesion  to  host  cells.  . 

In  addition  to  the  families  of  Plasmodium- 
specific  proteins,  chromosome  2  contains 
genes  for  many  secreted  and  membrane  pro¬ 
teins.  One  of  these  genes  encodes  a  protein 
with  a  modified  thrombospondin  domain  and 
was  transcribed  in  blood-stage  parasites  (17). 
Other  Plasmodium  proteins  containing 
thrombospondin  domains,  such  as  sporozoite 
surface  protein  2/TRAP  and  circumsporozo¬ 
ite  protein,  are  involved  in  the  parasitic  inva¬ 


sion  of  host  cells  (33),  suggesting  that  this 
protein  may  be  involved  in  the  binding  of 
infected  red  cells  to  host-cell  ligands. 

Determination  of  the  first  P.  falciparum 
chromosome  sequence  demonstrates  that  the 
A+T  richness  of  P.  falciparum  DNA  will  not 
prevent  the  sequencing  of  the  genome.  Al¬ 
though  technical  difficulties  not  observed 
during  the  sequencing  of  other  microbial  ge¬ 
nomes  were  encountered,  solutions  to  these 
problems  were  found  that  will  facilitate  se¬ 
quencing  of  the  remaining  chromosomes. 
The  genome  sequence  should  be  of  value  in 
the  study  of  Plasmodium  biology  and  in  the 
development  of  new  drugs  and  vaccines  for 
the  treatment  and  prevention  of  malaria.  In 
addition  to  these  practical  benefits,  the  Plas¬ 
modium  genome  sequence  should  provide 


broader  biological  insights,  particularly  in  re¬ 
gard  to  the  plasticity  of  the  eukaryotic  ge¬ 
nome  that  is  manifest  in  the  preponderance  of 
the  predicted  nonglobular  domains  in  plas¬ 
modial  proteins. 
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THE  NEW  YORK  TIMES,  TUE 


By  NICHOLAS  WADE 

Biologists  have  taken  a  first  step 
toward  cracking  the  code  of  the  par¬ 
asite  that  causes  malaria  and  using 
the  information  to  explore  all  the 
deadly  organism's  biological  weak¬ 
nesses.’  -  . 

The  first  of  tt|e  parasite's  14  chro¬ 
mosomes  has  been  fully  sequenced, 
meaning  the  order  of  its  DNA  units 
has  been  determined,  by  Scientists  at 
the  Institute  for  Genomic  Research 
in  Rockville,  Md.,  and  other  institu¬ 
tions.,  ciii<: . . . .■  ^ 

''This  is  a  very  important  mile¬ 
stone  for  malaria  research,''  said  Dr. 
Dyanne  Wirth,  a  malaria  expert  at 
the  Harvard  School  of  Public  Health. 
"It  provides  proof  of  principle  that 
we  can  sequence  Hie  genome  and 
insight  into  the  arrangement  of  the 
parasite's  genes." 

One  of  the  authors  of  the  research, 
Capt.  Stephen  L.  Hoff  man  of  the  Na¬ 
val  Medical  Research  Center  in  Be- 
thesda,  Md„  said  that  sequencing  the 
genome  provided  "the  foundation  for 
a  rational  strategy  for:  vaccine  and 
drug  development"  "What  we  are 
generating  here  is  the  road  map  for 
all  malarial  research  in  the  21st  cen¬ 
tury.  Without  it  we  are  in  the  Dark 
Ages,"  he  said.  The  report  appears  in 
the  current  issue  of  Science.  H 

Disease  experts  hope  that  knowl¬ 
edge  of  Hie  organism's  complete  ge¬ 
netic  game  plan  will  betray  weak 
points  against  which  new  drugs  can 
be  targeted.  Malaria,  once  curbed 
with  DDT  and  chloroqUine,  is  now 
resurgent  in  Africa  and  other  parts 
of  the  world  because  the  parasite  and 
the  mosquitoes  that  spread  it  have 
developed  resistance  to  the  usual 
control  measures.  Malaria  attacks 
more  than  300  million  people  every 
year  and  kills  about  2  million,  mostly 
in  tropical  Africa,  according  to  the 
World  Health  Organization. 

In  the  last  four  years,  new  tech¬ 
niques  of  reading  the  vast  stores  of 
information  encoded  in  DNA;  the  ge¬ 
netic  material,  have  enabled  biolo¬ 
gists  to  decipher  the  biological  pro¬ 


gramming  of  the  smallest  organ¬ 
isms,  mostly  pathogenic  bacteria 
whose  genomes,  or  total  DNA,  range 
from  one  to  five  million  Units  of  DNA. 

The  malaria  parasite,  a  single- 
celled  animal  known  as  a  protozoan, 
belongs  to  a  higher  order  of  complex¬ 
ity.  Its  genetic  endowment  runs  to  30 
million  units  of  DNA  and  contains  the 
genes  to  guide  it  through  the  extraor¬ 
dinary  transformations  of  its  devious 
life  cycle  in  the  intestine  and  sali¬ 
vary  glands  of  mosquitoes  and  in  the 
liver  and  blood  cells  of  people. 

No  genome  this  large  has  yet  been 
sequenced.  But  the  fact  that  the  pro¬ 
tozoan's  DNA  is  packaged  into  14 
chromosomes,  each  about  the  size  of 
a  bacterium's  genome,  encouraged 

Looking  for  a  genetic 
soft  spot  in  parasite 
with  a  devious  life 
cycle. 

the  idea  of  tackling  each  one  sepa¬ 
rately.  Because  of  the  urgency  of 
developing  new  defenses  against  ma- 1 
laria,  the  National  Institutes  of 
Health,  the  Defense  Department  and 
the  Wellcome  Trust  of  London 
formed  a  consortium  to  sequence  the 
genome.  ^  vV  ' :  ’iy: ’  -V 

The  Sanger  Center  near  Cam¬ 
bridge,  England,  started  to  sequence 
chromosomes  1  and  3  while  the  Insti¬ 
tute  for  Genomic  Research,  known 
as  TIGR,  began  with  chromosome  2 
and  has  finished  its  task  first. 

Dr.  J.  Craig  Venter,  until  recently 
TIGR's  president,  said  many  experts 
had  predicted  that  the  genome  would 
be  impossible  to  sequence  because 
its  DNA  does  not  remain  stable  when 
inserted  into  the  bacteria  used  to 
copy  or  clone  it.  "They  even  had  me 
i  spooked,”  Dr.  Venter  said. 

This  problem  was  overcome  by 
Dr.  Hamilton  0.  Smith,  a  member  of 


the  TIGR  team,  Who  discovered  that 
the  malarial  DNA  could  be  cloned  if 
cut  into  small  enough  pieces. 

The  chromosome,  the  second, 
shortest  in  the  malarial  protozoan's 
genome,  turns  out  to  have  947,103 
units  of  DNA,  coding  for  210  genes, 
according  to  the  researchers’  report. 

Biologists  can  tell  from  the  nature 
of  the  proteins  specified  by  these 
genes  which  ones  are  likely  to  pro¬ 
trude  from  the  malarial  cell,  offering 
possible  targets  for  vaccines,  and 
which  ones  control  metabolic  path¬ 
ways  that  could  be  disrupted  by 
drugs.  .  . 

"Can  you  imagine  having  all  the 
pathways  laid  out?"  «aid  Dr.  Louis 
H.  Miller,  a  malarial  expert  at  thev 
National  Institutes  of  Health-  #You 
can  do  experiments  in  a  day  or  two 
that  might  have  taken  months." 

But  a  different  set  of  genes  is 
probably  switched  on  at  each  of  the 
four  stages  in,  the  protozoan's  life 
cycle.  Captain  Hoffman  said  plans 
were  under  way  to  identify  the  genes 
that  are  active  at  each  stage  and 
design  countermeasures  according¬ 
ly. 

One  of  the  most  unusual  features 
of  the  malaria  protozoan  lies  in  its 
organelles,  units  that  possess  their 
own  DNA  and  perform  special  house¬ 
keeping  duties.  Both  animal  and 
plant  cells  possess  organelles 
thought  to  be  ancient  bacteria  that 
were  long  ago  enslaved  to  the  cell's 
needs.  Animal  cells  have  mitochon¬ 
dria,  which  generate  energy,  and 
plant  cells  have  plastids  that  per¬ 
form  photosynthesis. 

The  malaria  protozoan,  strangely 
enough,  has  both  types  of  organelles. 
It  possesses  mitochondria  and  apico- 
plasts,  organelles  that  resemble 
plastids  that  have  lost  their  equip¬ 
ment  for  photosynthesis. 

Dr.  Malcolm  J.  Gardner  of  TIGR,  a 
co-author  of  the  new  report,  noted 
that  the  apicoplast's  working  parts 
offered  particularly  good  targets  for 
drugs,  as  they  were  essentially  plant¬ 
like  and  should  have  no  counterparts 
in  the  human  body.  , » 
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Malaria  is  caused  by  apicomplexan  parasites  of  the 
genus  Plasmodium.  It  is  a  major  public  health  prob¬ 
lem  in  many  parts  of  the  world.  In  1994  the  World 
Health  Organization  estimated  that  there  were 
300-500  million  cases  and  up  to  2.7  million  deaths 
caused  by  malaria  each  year,  and  because  of  in¬ 
creased  parasite  resistance  to  chloroquine  and 
other  antimalarials  the  situation  is  expected  to 
worsen  considerably  (WHO  1997).  These  dire  facts 
have  stimulated  efforts  to  develop  an  international, 
coordinated  strategy  for  malaria  research  and  con¬ 
trol  (Butler  et  al.  1997).  Development  of  new  drugs 
and  vaccines  against  malaria  will  undoubtedly  be  an 
important  factor  in  control  of  the  disease.  However, 
despite  recent  progress,  drug  and  vaccine  develop¬ 
ment  has  been  a  slow  and  difficult  process,  ham¬ 
pered  by  the  complex  life  cycle  of  the  parasite,  a 
limited  number  of  drug  and  vaccine  targets,  and  our 
incomplete  understanding  of  parasite  biology  and 
host-parasite  interactions. 

The  advent  of  microbial  genomics,  i.e.  the  ability 
to  sequence  and  study  the  entire  genomes  of  mi¬ 
crobes,  should  accelerate  the  process  of  drug  and 
vaccine  development  for  microbial  pathogens.  As 
pointed  out  by  Bloom,  the  complete  genome  se¬ 
quence  provides  the  “sequence  of  every  virulence 
determinant,  every  protein  antigen,  and  every  drug 
target”  in  an  organism  (Bloom  1995),  and  estab¬ 
lishes  an  excellent  starting  point  for  this  process. 
Today,  the  complete  genome  sequences  of  13  mi¬ 
crobes  have  been  published,  including  several 
human  pathogens,  and  many  more  microbial 
genomes  are  in  the  works  (a  listing  of  microbial 
genome  projects  completed  or  underway  can  be 
found  at  www.tigr.org/mdb). 

Two  main  strategies  have  been  used  in  these  pro¬ 
jects.  One  approach,  pioneered  at  TIGR  (Fleisch- 
mann  et  al.  1995),  is  the  whole  genome  shotgun 
method,  in  which  a  genomic  library  of  sheared  1-2 
kb  fragments  is  prepared  in  a  plasmid  vector,  and 
clones  are  picked  at  random  and  sequenced.  Spe¬ 
cial  software  is  then  used  to  assemble  the  overlap¬ 
ping  fragments  into  a  contiguous  sequence.  The 


whole  genome  method  is  dependent  on  high-quality 
shotgun  libraries  and  robust  software  for  fragment 
assembly.  The  second  method,  used  to  sequence 
the  E.  coli  genome,  for  example,  involves  sequenc¬ 
ing  of  large-insert  clones  from  cosmid  or  lambda  li¬ 
braries  (Blattner  et  al.  1997).  Although  not  so  depen¬ 
dent  upon  computational  resources  as  the  whole 
genome  shotgun  method,  sequencing  of  large-in¬ 
sert  clones  does  require  a  physical  map  of  the 
genome  to  guide  selection  of  the  clones  to  be  se¬ 
quenced. 

At  first,  it  was  unclear  how  best  to  proceed  in  se¬ 
quencing  the  genome  of  P.  falciparum,  the  human 
malaria  parasite  responsible  for  the  most  morbidity 
and  mortality.  The  P.  falciparum  genome  is  about  30 
Mb  in  size,  about  8-  to  1 0-fold  larger  than  a  typical 
eubacterial  genome,  and  its  size  was  thought  to  pre¬ 
clude  the  whole-genome  approach  due  to  the  com¬ 
putational  limitations  inherent  in  the  assembly  pro¬ 
cess,  and  difficulties  in  closing  gaps  that  usually 
persist  after  assembly.  The  large-insert  library  ap¬ 
proach  was  ruled  out  by  the  fact  that  P.  falciparum 
has  an  overall  base  composition  of  approximately 
82%  AT.  This  unusual  base  composition  is  thought 
responsible  for  the  fact  that  P.  falciparum  DNA  is  no¬ 
toriously  unstable  in  E.  coli,  such  that  representative 
large-insert  (>  20  kb)  genomic  libraries  in  plasmid, 
lambda,  and  cosmid  vectors  that  could  be  used  for 
sequencing  cannot  be  prepared.  Yeast  artificial 
chromosome  (YAC)  libraries  of  P.  falciparum  (Foster 
and  Thompson  1995)  have  been  constructed,  how¬ 
ever,  and  while  these  appear  to  stably  maintain  large 
inserts,  YACs  are  not  very  well  suited  to  high- 
throughput  sequencing  projects. 


Abbreviations:  EBI:  European  Bioinformatics  Institute; 
EST:  expressed  sequence  tag;  DoD:  US  Department  of 
Defense;  GST:  genomic  sequence  tag;  NCBI:  National 
Center  for  Biotechnology  Information;  NMRI:  Naval 
Medical  Research  Institute;  PFGE:  pulsed  field  gel  elec¬ 
trophoresis;  TIGR:  The  Institute  for  Genomic  Research; 
TDR:  Special  Programme  for  Research  and  Training  in 
Tropical  Diseases;  YAC:  yeast  artificial  chromosome. 
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These  problems  led  to  development  of  a  third 
approach  to  genome  sequencing,  namely  shotgun 
sequencing  of  individual  chromosomes  purified  by 
pulsed  field  gel  electrophoresis  (PFGE).  P  falci¬ 
parum  has  14  chromosomes  ranging  from  0.8  to 
3.4  Mb  in  length.  Most  of  the  chromosomes  of  P. 
falciparum  clone  3D7  (the  clone  selected  for  se¬ 
quencing)  can  be  resolved  in  PFGE  gels,  except 
for  chromosomes  5-9  which  co-migrate  as  a 
“blob”  in  the  middle  of  the  gel.  Chromosomes  are 
resolved  on  preparative  PFGE  gels  and  chromoso¬ 
mal  DNA  is  extracted  by  agarase  digestion.  The 
chromosomal  DNA  is  then  sheared  into  1-2  kb 
fragments,  cloned  into  plasmid  or  M13  vectors, 
and  randomly-picked  clones  are  sequenced.  The 
sequences  are  assembled  to  form  contigs,  and 
techniques  such  as  PCR  from  genomic  DNA  with 
primers  derived  from  the  ends  of  contigs  are  used 
to  close  gaps  in  the  sequence.  Some  laboratories 
also  perform  limited  sequencing  of  shotgun  li¬ 
braries  prepared  from  YACs  previously  localized 
on  the  chromosomes  (Foster  and  Thompson 
1995).  The  YAC-derived  sequences  help  to  group 
contigs  from  the  same  part  of  the  chromosome 
and  assist  in  gap  closure. 

Three  groups  are  sequencing  the  P.  falciparum 
genome:  TIGR  and  the  Malaria  Program  of  the  US 
Naval  Medical  Research  Institute  (NMRI);  the 
Sanger  Centre  in  the  UK;  and  Stanford  University. 
An  international  consortium  including  the  genome 
laboratories,  bioinformatics  centers,  and  funding 
agencies  was  formed  to  oversee  the  project,  facili¬ 
tate  collaboration,  and  ensure  that  the  data  will  be 
provided  to  the  scientific  community  in  a  timely  and 
useful  manner  (Hoffman  et  al.  1 997).  Members  of  the 
consortium  meet  every  6  months  to  review  progress 


and  plan  future  work.  The  current  status  of  the  pro¬ 
ject  is  summarized  in  Table  1 .  The  strategy  of  se¬ 
quencing  on  a  chromosome-by-chromosome  basis 
led  naturally  to  assignment  of  individual  chromo¬ 
somes  to  the  different  genome  centers,  with  the 
“blob”  of  currently  unresolved  chromosomes  being 
undertaken  rather  heroically  by  the  Sanger  Centre. 
Progress  in  the  first  pilot  projects,  namely  chromo¬ 
some  2  by  TIGR/NMRI  and  chromosome  3  by  the 
Sanger  Center,  has  after  initial  technical  difficulties 
been  good  such  that  both  chromosomes  are  ex¬ 
pected  to  be  completed  shortly,  and  the  Stanford 
group  has  begun  work  on  chromosome  12.  Prelimi¬ 
nary,  unedited  data  have  been  released  into  the 
public  domain  and  are  available  for  downloading, 
browsing  or  searching  on  web  sites  maintained  at 
each  laboratory  (Table  2),  the  National  Center  for 
Biotechnology  Information  (NCBI),  and  the  Euro¬ 
pean  Bioinformatics  Institute  (EBI).  The  Sanger  Cen¬ 
tre  and  TIGR  have  started  work  on  the  other  chro¬ 
mosomes. 

Thus  despite  initial  scepticism  in  the  malaria  re¬ 
search  community  that  the  AT-rich  P.  falciparum 
genome  could  be  sequenced,  the  success  achieved 
on  chromosomes  2  and  3  proves  that  it  is  techni¬ 
cally  feasible,  and  malaria  researchers  should  soon 
have  access  to  the  complete  genome  sequence. 
Recent  technological  advances  such  as  stable 
transfection  of  Plasmodium  spp.,  and  microarray 
technologies  for  global  measurement  of  gene  ex¬ 
pression,  in  combination  with  the  genome  se¬ 
quence,  will  facilitate  research  to  understand  Plas¬ 
modium  biology.  In  addition,  sequencing  efforts 
planned  or  underway  for  other  Plasmodium  species 
and  other  Apicomplexa  such  as  Toxoplasma  (Table 
2)  will  provide  useful  complementary  data.  Although 


Table  1.  Chromosome  assignments  and  sequencing  status  for  the  Malaria  Genome  Sequencing  Project. 


Chromosome(s)a 

Size  (Mb) 

Laboratory 

Funding13 

Status  (as  of  3/98) 

1 

0.8 

Sanger  Centre 

Wellcome  Trust 

random  sequencing 

2 

1.0 

TIGR/NMRI 

NIAID,  DoD 

annotation 

3 

1.2 

Sanger  Centre 

Wellcome  Trust 

closure 

4 

1.4 

Sanger  Centre 

Wellcome  Trust 

random  sequencing 

5-9 

1.6-1 .8 

Sanger  Centre 

Wellcome  Trust 

library  preparation 

10 

2.1 

TIGR/NMRI 

NIAID,  DoD 

library  preparation 

11 

2.3 

TIGR/NMRI 

NIAID,  DoD 

library  preparation 

12 

2.4 

Stanford  University 

BWF 

random  sequencing 

13 

3.2 

Sanger  Centre 

Wellcome  Trust 

library  preparation 

14 

3.4 

TIGR/NMRI 

BWF,  DoD 

random  sequencing 

Estimated  sizes  forP.  falciparum  clone  3D7  taken  from  Dame  et  al.  (1996). 

bNIAID,  National  Institute  for  Allergy  and  Infectious  Diseases;  DoD,  US  Department  of  Defense;  BWF,  Burroughs 
Wellcome  Fund. 
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Table  2.  Internet  resources  related  to  the  Malaria  Genome  Sequencing  Project. 


Web  Site 

Content 

URL 

P.  falciparum  chromosome  2 

TIGR 

Preliminary  sequence  data  for  chro¬ 
mosome  2. 

http://www.tigr.org/tdb/mdb/pfdb/ 

pfdb.html 

P.  falciparum  chromosomes  1,3,4 
The  Sanger  Centre 

Preliminary  sequence  data  for  chro¬ 
mosomes  1, 3,  and  4. 

http://www.sanger.ac.uk/Projects/ 

P_falciparum/ 

P.  falciparum  chromosome  1 2 
Stanford  University 

Preliminary  sequence  data  for  chro¬ 
mosome  12. 

http://sequence-www.stanford.edu/ 

group/malaria/index.html 

P.  falciparum  Gene  Sequence  Tag 
Project,  University  of  Florida 

A  collection  of  ESTs  and  GSTs  for  P. 
falciparum. 

http://parasite.arf.ufl  .edu/malaria.  htm  1 

Malaria  Database 

Monash  Univ.,  Walter  and  Eliza 

Hall  Institute 

A  collection  of  genetic  information  on 
malaria  parasites.  Sponsored  by 
WHO/TDR. 

http://www.wehi.edu.au/MalDB- 

www/who.html 

Malaria  Genetics  and  Genomics 
National  Center  for  Biotechnology 
Information 
(NCBI) 

BLAST  searches  on  Apicomplexan 
sequence  data,  including  P.  falci¬ 
parum ;  P.  falciparum  linkage  maps, 
etc. 

http://www.ncbi.nlm.nih.gov/Malaria/ 

Parasite  Genomes  Blast  Server 
European  Bioinformatics  Institute 

BLAST  searches  on  sequence  data 
from  many  parasites,  including 
Plasmodium. 

http://www.embl-ebi.ac.uk/parasites/ 

parasite_blast_server.html 

Malaria  Foundation 

General  information  on  malaria  and 
many  links  to  malaria-related  sites. 

http://www.malaria.org/index.htm 

Toxoplasma  Database 

University  of  Pennsylvania 

Toxoplasma  ESTs  clustered  with 

ESTs  from  dbEST. 

http://daphne.humgen.upenn.edu: 

1 024/toxodb/ver_1  / 

TIGR  Microbial  Database 

A  comprehensive  listing  of  microbial 
genome  projects. 

http://www.tigr.org/tdb/mdb/mdb. 

html 

it  is  a  long  way  from  laboratory  research  to  the  field¬ 
ing  of  new  drugs  or  vaccines,  with  the  advent  of  mi¬ 
crobial  genomics  we  can  expect  the  process  to  be 
speeded  up  considerably. 
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An  international  consortium  of  genome  centres,  advanced  development  teams 
and  funding  agencies  has  begun  the  task  of  sequencing  the  genome  of  the 
parasite  Plasmodium  falciparum,  the  most  important  cause  of  human  malaria. 
Sequencing  is  proceeding  chromosome  by  chromosome,  and  the  annotated 
sequence  of  chromosome  2  is  nearly  finished.  With  the  continual  release  of 
sequence  data  as  they  are  generated,  malaria  researchers  have  access  to  a 
steady  stream  of  genomic  sequences  and  will  soon  have  the  complete 
annotation  of  all  of  the  estimated  5000-7000  P.  falciparum  genes.  The  task  will 
then  be  how  to  best  apply  these  data  to  the  development  of  new  anti-malarial 
drugs,  vaccines  and  diagnostic  tests.  This  review  provides  a  brief  overview  of 
the  Malaria  Genome  Sequencing  Project  and  suggests  potential  directions  for 
future  malaria  research. 


molecular 

medicine 


In  1977,  Fred  Sanger  and  his  colleagues  heralded 
in  the  field  known  as  genomics,  having  shown 
that  it  was  possible  to  determine  the  entire  genetic 
sequence  of  the  virus  phi*X  (<j>X174)  (Ref.  1).  In 
that  same  year  die  completed  genome  of  another 
virus,  simian  virus  40  (SV4Q),  was  also  reported 
(Refs  2,  3).  Sequencing  progressed  rapidly  as 
genomes  an  order  of  magnitude  larger,  such  as 
the  bacteriophages  T7  and  lambda,  were 
completed  (Refs  4, 5).  Within  15  years  of  the  first 
viral  genome  being  completed,  the  genomes  of 


plant  chloroplasfs  and  the  Epstein— Barr  virus 
(EBV)  were  finished  (Refs  6,  7),  and  many 
individual  gene  sequences  were  deposited  into  the 
public  domain  (Ref.  8).  These  first  projects,  though 
tedious  to  complete  and  requiring  a  great  deal  of 
manual  effort,  first  suggested  that  'unlocking  the 
key  to  fhe  genome'  was  possible;  ihey  also  showed 
that  it  was  technically  feasible  to  determine  the 
entire  genome  sequence  of  an  organism  and  thus 
gain  access  to  the  description  of  its  fundamental 
biology.  Automated  sequencing  apparatus  and 
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produce  and  analyse  large  quantities  of  DNA  in  genome  sequencing  that  have  jjo™™ 
rapidly  Sd  efficiently  (Refs  10,  11,  12,  13).  the  5386  base  pairs  of  phi-X  <*X174 I  were  firs 
Stimulated  by  die  success  of  previous  sequencing  published.  Today,  the  entire  genom 
projects  and  the  tantalising  potential  benefits  of  sequenced  virus  couldbe  completed  by  a  handful 
large-scale  sequencing,  scientists  in  the  mid-1980s  of  people  in  a  genome  centre  in  an  afternoon. 

considered  the  enormous  task  of  determining  the  .  .  .  . 

sequence  of  the  entire  human  genome.  At  3  billion  Sequencing  strategies  .  f 

{3  X  105)  base  pairs  (bp),  the  Human  Genome  The  approach  taken  towards  the  sequencing ;  of 
Project  represented  the  largest  genome  project  the  genome  of  an  organism  depends  on  a  variety 
^  mSaken.  Although  most  of  the  talttal  offactot^ 

focus  of  theHuman  Genome  Project  has  been  ihe  reagents,  ihe  size  of  the  gmmme  of  the 
production  of  genome  'maps',  attention  is  now  and  other  characteristics  of  die 
turning  towards  sequencing  (Ref.  14).  Indeed,  size  of  a  given  genome  project  increases,  methods 

^TbSc  human  genome  maps  were  are  employed  to  partition  the  entire  genome  mto 
completed,  large-scale  efforts  were  directed  smallersubunite ThisusuaUy  means  consftucbng 

sfoiTr*  “  hmMn  ^  2s»^se.«e 

'  9  '  5\i gh  as  a  cosmid-  Cowmiids  can  accept  inserts  as 

Efforts  have  not  been  solely  centred  on  the  large  as  40  kilobase  pairs  (kb)  and  are  arranged  in 
Human  Genome  Project  the  determination  of  the  order  along  the  genome,  creating  a  physiral  m  p 
genetic  sequences  of  microbial  organisms,  of  the  genome.  Once  the  mmimum  numbcr 
especially  those  of  human  pathogens,  has  the  overlapping  cosmids  has  been  deternuned 

potential  to  revolutionise  the  development  of  new  sequencing  begins  on  each  identified  coan«d.  By 

drugs  and  vaccine*.  A  milestone  in  genomics  was  sequencing  cosmid-Siied  fragments  of  t 

-  o  _  j*±*  Kprnmrs  more 


reached  in  1995  when  the  first  complete  bacterial 
genome  was  reported  (Ref.  15).  The  genome  of 
the  free-living  bacteria  Haemophilus  influenzae,  at 
1.8  million  bp,  was  the  largest  completed  genome, 
and  the  first  example  of  a  'whole-genome 
sequencing  strategy'  being  applied  to  a  microbial 
organism.  On  the  heels  of  this  news,  Barry  Bloom 
in  a  leader  article  in  Nature  considered  that  'the 
power  and  cost-effectiveness  of  modem  genome 
sequencing  technology  mean  that  the  complete 


genome,  data  handling  becomes  more 
manageable  and  assembly  of  the  final  sequence 
is  facilitated.  However,  as  the  genome  size 
increases  beyond  several  megabase  pairs  (Mb) 
(consider,  for  example,  the  human  genome,  which 
comprises  3  x  IQ*  bp),  an  additional,  upper  layer 
of  organisation  is  needed.  The  reason  for  this 
becomes  clear  when  one  considers  subdoning  the 
entire  human  genome  using  cosmids;  a  ten-fold 
representation  of  die  genome  (every  region  m  the 


sequencing  technology  mean  that  me  complete  repxwenm^  ^ u«, 

genome  sequences  of  25  of  the  major  bacterial  and  DNAUbraiy  should  ta  present  at  least :  ton iMd 
parasitic  pathogens  could  be  available  within  five  would  require  750,000  cosnud  dones^Bytodays 
years’  (Ref.  16).  At  that  time,  he  thought  that  'for  standards,  this  would  be  an  unmanageable 
about  100  million  dollars,  we  could  buy  the  number  of  clones  to  characterise  and  arrange  n 
seciuence  of  every  virulence  determinant,  every  order  along  the  genome.  For  such  large  organisms, 
protein  antigen  and  every  drug  target'  (Ref.  16).  a  large-insert  library  must  be  created, fj™ 

He  predicted  that  this  up-front  investment  in  bacterial  vgS^damamawtf^^^m 
genome  sequencing  would  produce  information  or  yeast  artificial  chromosomes  (YACs,  Ref.  18  and 
SSdbc  available  to  scientists  forever,  and  reviewed  in  Ref.  19),  which  can  weep.  DNA 


that  'we  could  then  think  about  a  new,  post- 
genomic  era  of  microbe  biology7  (Ref.  16). 

Genome  sequencing  is  progressing  at  an 
extraordinary  rate;  there  are  now  13  published 


fragments  that  were  several  hundred  kb  in  size. 
A  low-resolution  physical  map  (with  relatively 
few  clones  widely  separated)  is  created  using 
BACs  or  YACs:  these  are  then  subcloned  into 
cosmids  to  produce  a  high-resolution  map.  The 
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different  position)  of  chimeric  (two  previously  sequencing  one  or  more 

separate  sections  are  brought  together)  and,  thus,  the  parasites  responsib^ 

are  of  no  use  for  sequencing.  Furfhennore,  DNA  malaria.  The  result  of  the  meeting 

" YAO^of'^contamlnated  with  yeast  mtabhshment  of  an  tatemanoml  »»»*»^ 

b=r«« 

Sdude^andom-rnotgon  sefoenin g,  using  (or  nearly  all  o( ;  *e  deato  due  » ,-m m 
small  (l-2kb)  fragment"  of  sheared  DNA,  of  humane  (Ref.  »)■  A  Pj’W""*  MD 
complete  genomes  and,  for  larger  genomes,  a  for  Genomic  Research  (TlGR,  R°ck  '  *  ' 

BAGcnd  Jjatoctog  *ategy  (Ref.  20).  to  BAC-nnd  USA)  «nd  the  NndKMtal 
seauencme  the  minimum  number  of  overlapping  (NMRI,  Rockville,  MD,  USA)  Y 

donfisneiwsary  to  cover  a  region  of  the  genome  NIHand: toeUSDeparmenMl ^ 

is  Identified  and  sequenced-  Only  the  end  of  each  devetop  sequenemgsteateges  for  the Madam 
done  is  sequenced;  the  information  is  collected  Genome  SequmangPioject.  Hush®  resulted  m  the 
and  the  gaps  are  then  filled  by  further  full-length  complete  1-Mb  sequence  of 
sequencing.  As  sequencing  technology  and  chromosome  2 _(manuscnpt m  p  P  £ 

computer  assembly  algorithms  improve,  it  will  Sequencing  0  e  t™.,rrGR/NMRL 

become  possible  to  sequence  larger  genomes  by  proceeding  at  three  genome  cen  *  ctanforti 
the  'shotgun  method?  These  sequeneed-based  the  Sanger  Centre  (Hinxton,  IIK)  and  Stimfo 
methods  have  the  potential  to  expedite  genome  University  (Stanford,  CA,  USA),  wfih  ftmdu® 
sequencing  and  further  reduce  the  overall  costs  from  the  BurroughsWellcome  Fund,  fiieNffl,the 
involved  °  Wellcome  Trust,  and  the  DoD.  Early  efforts  have 

*n  ’  met  with  success  and  thus  the  consortium  is 

Sequencing  microbial  genomes  pushing  forward the  intent  of  completog  the 

For  smaller  genomes,  such  as  those  of  bacteria  that  entire  genome  of  F.  falciparum  by  2002-2003. 
range  in  size  from  600  kb  to  5  Mb,  shotgun 
sequencing  of  the  whole  genome  is  becoming 
routine.  The  first  bacterial  genome,  H-  influenzae, 
was  completed  almost  entirely  by  'shotgun 
sequencing'  (Ref.  15);  that  is,  the  entire  genome 
(1.8  Mb)  was  sheared  into  small  (1-3  kb)  fragments 
and  randomly  cloned  into  the  sequencing 
plasmid-  The  development  of  a  coordinated, 
sequencing  effort,  an  integrated  database 
management  system,  and  improved  sequence- 


Clinical  implicatlons/appllcations 
Why  sequence  the  malaria  genome? 

The  world  malaria  situation  is  worsening.  The 
World  Health  Organization  (WHO)  estimates  that 
oue  quarter  of  the  population  of  the  world  lives 
in  malarious  areas  and  that  300—500  million  cases 
of  malaria  occur  annually  (Ref.  24).  Although 
more  than  2.6  million  people  die  every  year  from 

management  system,  ana  unproved  seHueni:rr-  this  disease,  few  people  in  the  _Yand 

assembly  software  meant  that  the  entire  genome  realise  the  enormous  economic,  P<“ltlcaJ.  nd 
SttfflSZ could  be  completed  with  -  24,000  social  burden  that  malaria  places  on  those  living 
successful  sequencing  reactions,  all  within  with  this  disease  Ref.  24).  In  addition,  J*ng 

approximately  one  year.  Even  as  the  H.  influenzae  numbers  of  people  wdl  be  exposed  as 

sequence  publication,  'went  to  press',  sequencing  tire  effects  of  global  warmmg  are  manifested las 

of  two  additional  microbial  genomes  was  already  the  mosquito  vector  of  ^ 2M  vSth 
nearing  completion  (Refs  21,  22).  Of  the  13  the  non-malanous  world  (Refs  25,  26).  With 
microbial  genomes  that  have  been  sequenced  to  increasing  ah  travel  ma  shrmlongworid^ntiie 
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future,  more  people  .hot  previously  were  not 

generally  (^gsedto malaria wfflbeplacedato^  ^  e^espective  antigens  ^limited.  Animal 

models  do  exist  for  malaria;  however,  none 


generally  exposed  to 
from  this  disease.  Unfortunately,  drug  resistance 
In  P.  falciparum  to  chloroquine,  one  of  the  best  an  ti- 
malarial  drugs  ever  developed,  is  widespread  and 
is  found  in  most  of  the  malarious  world.  Other 
spedes  Of  Plasmodium,  particularly  R  vivax,  are 
also  beginning  to  develop  patterns  of  chloroquine 
resistance  (Ref.  27).  Success  in  developing  new 
anti-malarial  drugs  has  been  short-lived  because 


JTL0aeA5  CLU  ttAl&lr  J.U1  j , -  -  ' 

reproduces  accurately  the  pathology  that  i$  seen 
inhuman®.  Although  the  transfection  of  genes  mto 
malaria  parasites  has  been  developed  recently  it 
is  being  used  in  only  a  few  laboratories  and  is  not 
yet  routine.  Finally  of  the  estimated  5000-7000 
genes  in  Plasmodium  spp.,  only  a  few  hundred  are 
anti-maiariai  orugs  nas  vc«=»  »  th  rtDresent  jitfle  more  than  a  brief 

plasmodium  parasites  continue  Jodsvelop  ^^J'fjSBncusedbydwpwaslte. 
resistenre  to  biosd  classes  of  anu-  P  ^  ^  Worra,Hon  is  needed  to  develop 

in  fact,  most  of  the  drugs  used  for  anti-matanul  Ueariy  m  ■  The  advances  in 

prevention  and  therapy  such  as  mefloquine  are  novel  anti-malarial  Strategies.  .  . 

smmm  ns§t =m 

r«.Sle  ^^»u7n.eded  to  combat  the  tools  to  assist  inthe  discovery  of  "ova lteigeBta 
strategies  are iuig  Jr  &  development  of  malaria  vaccines  and  anti- 

menaong  problem  of  malaria.  malaria  Sags.  It  should  yield  targets  for 

The  difficulty  of  the  situation  that  faces  malaria  improved  diagnostic  teste  and 

^Spo^Slly£!«diretaofm..ade 
millennia  in  a  hostile  immune  environment;  there  in  humans. 

parasite  possesses  a  complex  multistagelife  cycle  The  plasmodiumge 
(Fie  1  fieOOldcn)  in  both  a  vertebrate  (such  as  The  genome  otFlamodtum  spp.  1$  ~3U  *™* j 

humwt)andan  invertebrate  (such  as  an  Awapbdes  dlsttbuted  a^Hdumm^wtedh^ 
,p.  mosquito)  host.  I.  exists  (1)  free  In  the  in .steel horn  KO  kb  to  W l  Mb «ech  C™ 1 
circulatory  system;  (2)  inside  liver  cells  tebOOldcn).  Figure  2  (fig002dcn)  shows  . 
(hepalocvti)  which  are  capable  of  presenting  comparison  of  the  sizes  of  some  other  Seines 
SSSSTStoSS  in  assomadon  vrtfh  major  Plum'd thm  faldpam.m  red  se veral  other 
histotompatMity  complex  (MHC)  molecules;  Ptanoii»e.spp.  aieunusuelmthattheugeu 
and  (3)  hSlde  red  blood  cells,  which  In  humans  have  an  extraordinary  hum  towerde 
do  not  have  an  MHC-rcetriced  antigen  nucleotide.:  .deem.  (A)  end %mn»  TO  £ 
nTefventatiori  pathway.  The  parasite  is  exposed  to  regions  that  code  for  proteins,  the  AT  bi 
both  humoral  (soluble)  and  cellular  immune  v*aet  *^6%,  whom »  * ™ 
mechanisms.  It  has  also  developed  complex  drug-  (re^ons  between  genes)  an  before 

resistance  mechanisms,  which  span  a  broad  range  within  genes  that  are  remo  c  uirnw 

SSSSSSTdUuof  Lw  anti-malull  transcription), the A-T content ;cmi approach  100% 
drugs  and  vaccines  must,  therefore,  consider  both  (Refs  29, 30).  This  extreme  A-T  bias  «  ^ 

immune  evasion  (avoidance  of  the  immune  be  responstble  for  the  o  serv  ^nts 

svstem)  and  drug-resistance  mechanisms.  Any  cloning  and  maintaining  larg  g 
SvTppTach  must  also  be  directed  against  (greater  An  severe!  hb)  of  f-yfepenire  DNA 
mnitinlc  staees  of  the  parasite,  altogether  a  inEs(^mc^co/z{R©f,31).Iliisi3nstahihty ha  , 
momentpus  undertaking.  In  many  ways,  malaria  problematic  becau^  there  are,  as  yet,  ru> «i  _ 
researchers  are  woefully  under-equipped  to  deal  libraries  available  that  can  accep  arg 
3 STcS»piSte.  B«eSe  in  oifre  P.  fddp**m  DNA.  The  development  of  YACs 
cuMvSon  of  most  nSlatte  parasites  is  routinely  (Ref.  16)  has  been  applied  with  success  to 
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Figure  1.  The  life  cycle  of  Plasmodium  falciparum  me\at\a  parasite.  Malaria  is  caused  by  infection  with  an 
obligate,  intracellular  protozoan  parasite  of  the  genus  Plasmodium.  Of  the  four  species  that  infect  humans 
(Plasmodium  falciparum,  Plasmodium  vivax,  Plasmodium  ovale  and  Plasmodium  malariae),  it  is  £  falciparum 
that  is  responsible  for  virtually  all  deaths.  The  life  cycle  of  Plasmodium  spp.is  complex  and  somewhat 
speoifio  to  tho  parasite  spades,  (a)  P.  falciparum  infection  in  humans  begins  when  an  Infected  Anopheles 
sp.  mosquito  takes  a  Wood  meal  and  injects  infective  sporozoites  into  the  peripheral  circulation,  (b)  Within 
minutes,  these  sporozoites  invade  hepatocytes  in  the  liverand,  over  approximately  one  week,  undergo  asexual 
multiplication,  producing  tens  of  thousands  of  merozolte  forms  of  the  parasite,  (c)  When  the  mrected  hepatocyte 
ruptures,  morozo'ites  are  released  into  the  peripheral  diculation.  (d)  The  mercttoitee  ^invade  red  blood  cells 
(rbcs)  and  (e)  complete  another  round  of  multiplication  within  48-72  h,  with  the  production  of  1 6-20  additional 
merozoites  per  rbo,  which  devour  the  rbe  haemoglobin  in  the  process,  (f)  The  released  merozoites  invade 
additional  rbcs  and  carry  on  the  cycle.  It  is  the  synchronous  release  of  merozoites  that  is  thought  to  be 
responsible  for  the  periodic  fevers  associated  with  malaria,  (g)  Some  invading  merozoites  do  not  divide,  but 
differentiate  into  male  (microgametooyte)  and  female  (macrogametocyte)  sexual  forms,  (h)  These  sexual 
forms  are  taken  from  the  bloodstream  by  a  feeding  Anopheles  sp.  mosquito  and  (i)  fertilise  in  the  mosquito 
mldgut  to  form  zygotes.  These  zygotes  further  differentiate  into  motile  forms,  called  ookinetes,  migrate  through 
the  mosquito  gut  wall  and  divide  within  oocysts  on  the  external  gut  wall  to  form  thousands  of  sporozoites. 

(j)  The  infective  sporozoites  are  released  into  the  mosquito  haemocoele  and  move  to  the  salivary  gland, 
where  they  await  injection  into  another  human  host,  thus  completing  the  life  cycle  (figOOl  den). 

P.  falciparum  (Refs  32,  33)  and  most  recently  to  Sequencing  the  P.  falciparum  genome 
P.  vivax  (Ref.  34),  presumably  owing  to  the  similar  In  designing  sequencing  strategies  for  the 

nucleotide  composition  of  Plasmodium  spp.  and  P.  falciparum  genome  project,  the  consortium 
the  yeast  Saceharomyces  cerevisiae.  This  strategy  is  focused  first  on  several  technical  hurdles.  The  first 

being  used  extensively  by  malaria  one  was  the  concern  that  the  high  A-T  bias  and 

researchers  (Ref.  35).  observed  inability  to  produce  representative 

large-insert  genomic  libraries  in  E.  coli  would 
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Table  1 .  Sequencing  of  Plasmodium  falciparum  chromosomes  (tebOOIdcn) — 

Status/comment 

Partially  sequenced 
Sequence  completed 
Nearing  completion  of  sequence 
Partially  sequenced 


Chromosome 

Size  (in  Mb) 

Sequence  centre 

1 

2 

0.65 

Sanger 

1.0 

TIGR/NMRI 

3 

1.2 

Sanger 

4 

1.4 

Sanger 

T 

5 

1.6 

Sanger 

Q 

1.6 

Sanger 

7 

1.7 

Sanger 

3 

1.7 

Sanger 

9 

10 

11 

1.8 

2.1 

2.3 

Sanger 

TIGR/NMRI 

TIGR/NMRI 

12 

2.4 

Stanford 

13 

3.2 

Sanger 

14 

3-4 

TIGR/NMRI 

Partially  sequenced 


Abbreviations  used  and  links  to  further  sequence  information:  Mb  =  megabase  Min;  Sanger  l^fnomic 
Cantre  iHbixton  UK)  http://www.sanger.ac.ukff3rojecis/P_folciparumi  TIGR  =  The  Institute  tor  Genomic 

Research  (Rockdale,  MD,  USA), htfo//vmw.tigr.org/tdb/mdb/pfdb/pf<fc^tmljNMRI  =  NavaIMe  'P?  c 

Institute  (Rockville,  MD.  USA),  htfo^www.nmri.nnmc.nayy  mil;  Stanford = Stanford  University  (Stanfo  ,  , 

USA),  http://sequence-www.slanford  ,edu/group/malana/index.ntmi 

exclude  the  possibility  of  using  large-insert  chromosomes  that  do  not  co-migrate  by  PFGGE 
SSdibrarirtos^Licethe  whole  genome  and  treat  each  one  as  if  it  were  anmdivtdual 
using  the  BAC-end  sequencing  approach  1-3-Mb  microbial  sequencing Mapptag 
(Ref.  20).  The  second  hurdle  was  that  although  data  that  were  already  available  from i  the 
YAC  libraries  of  P.  falciparum  were  available  ongoing  P.  falciparum  Genome  Pf0Je^ 

(Refe  32, 33)  YACs  were  nqtconsidered  to  be  good  sponsored  by  the’ Wellcome  Trust 
substrates  for  sequencing;  also,  bacterial  subdone  provide  important  information  for  the  process  of 
libraries  derivedfrom  YACs  were  notorious  for  closing  the  gaps.  Individual  chromosomes^ 
their  contamination  with  yeast  chromosomal  were  to  be  sequenced  were  assigned  to  three 
DNA,  Finally,  the  large  size  of  the  malaria  parasite  genome  centres:  TIGR  (in  conjunction  With  th 
genome  aE  theintention  to  divide  the  NMRI),  the  Sanger :  Centre :  and  Stan  o  d 
fencing  efforts  among  several  laboratories  University  (Table  l/tab001dcn)  R^domjhotgun 
meant  thaf  a  whole-genome  shotgun  approach  libraries  that  were  sPeclflc  fo^ 

not  be  easily  partita  n  .  fibraries  were  sequenced.  In  addition,  some 

Scauendne  strategies  for  P.  falciparum  sequencing  centres  have  used  some  YAC-based 
The  consortium  agrefd  on  an  approach  based  on  sequencing,  Atthe  time  01 

the  fact  that  most  of  the  14  plasmodium  0fthel-Mbduomosome2  (atTlGR/Nh^)-md 
iLomoaomes  can  be  separated  by  poised-field  the  1.2-Mb  chromosome 3  (atthe  Sa"S^n^ 
gradient  gel  electrophoresis  (PFGGE).  In  fact  is  nearly  complete,  and  significant 
Eng  this  commonly  used  molecular  biology  been  made  on  *r0m0S^eJ?h^oS^d 
technique  (Ref.  36),  over  80%  of  the  genome  from  University)  and  on  several  other  chromosomes. 

P.  falciparum  can  be  separated  as  Individual  .  .  outstandina 

chromosomes;  the  remaining  20%  that  consists  of  Research  in  Pro9f®®®  4 *"2,™  9 

five  co-migrating  chromosomes  cannot  be  research  questlO 

seoarated  hi  this  way.  A  derision  was  made.  The  successful  sequencing  of  the  first  two  of  the 
SerefoS  To  approach  the  sequencing  of  the  14  chromosomes  of  P  falciparum  has  proven  the 
malaria  genome  one  chromosome  at  a  time.  The  feasibility  of  completongthe  entire  genome  of  dus 
plan  was  to  separate  those  P.  falciparum  parasite.  Malaria  researchers  and  genome  canto* 

BPS'h  Molecular 


The  Malaria  Genome  Sequencing  Project 


9-10-1998  9:52AM  FROM  NMRI  ANNEX  LIBRARY  3013190920 

U9/V9  90  10.44  yum  U44  UlUi 


p.  1 

i^juua 


Papor  number  IF  1 1 0503 

Sequencing  the  genome  of  Plasmodium  falciparum 

Daniel  j.  Caruccia,  Malcolm  J.  Gardner*5,  Herve  Tettelinb,  Leda  M.  Cummingsb, 
Hamilton  O.  Smith6,  Mark  D.  Adams6,  J,  Craig  Venter  and 
Stephen  L.  Hoffman® 


Advances  te  microbial  genomio  sequencing  have  tbo 
potential  to  revolutionize  the  control  of  infectious  diseases. 
Recently,  a  consortium  of  researchers  and  fending  agencies 
from  the  United  States  and  Great  Britain  have  embarked  on 
a  project  to  sequence  the  genome  from  Plasmodium 
falciparum,  tKq  most  important  cause  of  human  materia.  Tha 
Malaria  Genome  Sequencing  Project  has  reached  an 
important  milestone  with  the  completion  of  the  entire  DMA 
eequance  and  annotation  of  cfrcrnosome  2>  a  950  kHobosc 
chromosome  of  Plasmodium  faidparum.  This  review  article 
wifi  provide  an  overview  of  the  malaria  genome  eequeocbg 
project,  highfight  progress  in  the  field  of  microbial 
sequencing,  and  suggest  new  directions  for  future  materia 
research.  Curt  Opin  in?«ct  Ok  11:000-000.  s  i960  wrswra  a 
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Introduction 

Malaria  in  hunumx  is  caused  by  infection  with  one  of  four 
species  of  the  apicmnpfexan  par? si o?  of  the  genus 
Plasmodium,  Plasmodium  fnldporum,  Plasmodium  vhxtrx, 
Plasmodium  malarias,  and  Ploxvttxkum  ovaU  through  the 
bite:  of  &  female  Anopheles  species  mosquito.  Although  it  h 
infection  with  P,  falciparum  thsc  b  the  primary  cause  of 
human  mortality  associated  with  malaria,  the  ocher  Plasmo¬ 
dium  species  also  contribute  to  the  countless  episodes  of 
illness  in  chose  infected.  The  WHO  has  estimated  that 
there  are  309-509  million  eases  of  materia  annually  and  that 
1 .5-2,7  million  people  die  as  a  rcsulc  of  malaria  each  year. 
The  world  materia  situation  is  worsening  because  of  the 
spread  of  anrimaterial  drug  resistance,  and  because  of 
resistance  xo  insccdddes  by  the  mosquiro  vector.  In 
addition.  roulu-diug-resistant  parasites  have  meant  due 
licswly  ralcu^cd  and  malarial  compounds  have  been  routinely 
short-lived  as  die  materia  parasite  has  evolved  resistance 
against  these  new  compounds.  Indeed,  although  the 
concern  of  drug<ctt$canc  malaria  has  boon  focused  primarily 
on  P.  fa/dpanrm%  drag  resistance  has  also  extended  to  P. 
vhax^  ehlaroqubic-resisemr  ,P.  vevax  is  now  found  in 
Oceania,  southeast  Asia,  a ad  in  pores  of  South  .America.  A 
poor  understanding  of  drug  resistance  mechanisms  *md  a 
dearth  of  biochemical  drug  targets  have  hindered  the 
development  of  effective  long-lived  aniimalarial  drugs. 

To  date  there  k  no  licensed  vaccine  against  materia. 
Evidence  from  immunological  and  epidemiological  data  hi 
malaria,  endemic  arcs*  and  from  laboratory  data  in  animal 
and  human  models,  however,  suggest  thac  the  development 
«f  an  effective  andmaterial  vaccine  U  feasible  [l**h  The 
immum/jijon  of  mkar,  and  humans  with  r&dtation<ictenu» 
ated  sporoxoioejt  provides  100%  protection  against  sniteri* 
(2).  Furthermore,  vaccines  directed  against  antigens  ex¬ 
pressed  during  the  blood  stage  of  the  parasite  confer 
protection  in  mice  and  non -human  primates.  The  develop¬ 
ment  of  novel  vaccine  delivery  systems,  particularly  DMA 
vaccines,  has  the  potential  to  (revolutionize  vaccinology. 
DMA  vaccines  provide  a  tool  for  rapidly  screening  vaccine 
candidates  by  circumvenang  the  need  to  produce  recombi¬ 
nant  protein  immunogen*.  The  development  of  a  successful 
malaria  vaccine  will  probably  require  targeting  of  the 
numerous  antigens  expressed  st  various  sages  of  the 
complex  life  cycle  of  the  parasite,  and  will  need  to  be  able 
to  generate  the  protective  immunological  responses  direc¬ 
ted  against  that  antigen.  Although  the  immunological 
responses  against;  several  malaria  antigens  have  been  well 
studied,  must  materia  researchers  would  agree  choc  a  major 
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stumbling  block  to  the  development  of  effective  HJitimahr 
ial  vaccine*  i$  the  paucity  of  wcll-chai'accc risked  vaccine 
targets  against  particular  parasite  stages. 

The  need  lor  new  directions 

Clearly,  the  successful  development  of  antimahiial  drugs 
sod  vaccines  will  require  a  more  complete  understanding  of 
the  complex  life  cycle  of  chc  parasite,  and  die  mechanisms 
underlying  drug  resistance  and  rhe  avoidance  of  protective 
hose  immune  fcspowc*.  A  detailed  study  of  die  genetic 
wdc  of  the  parasite  may  provide  the  necessary  tools  for  the 
identification  of  crucial  biochemical  targes  for  the  devel¬ 
opment  of  cntinvtarisd  drug?  and  reveal  targets  of  protective 
immunity  that  might  otherwise  be  undiscovered.  Indeed, 
advances  In  microbial  genomic  sequencing  have  progressed 
rapidly  over  the  past  few  yeuis,  to  %  point  where  it  is  now 
possible  co  determine  the  entire  genetic  sequence  of  an 
organism-  Once  deciphered,  the  entire  genetic  code  of  a 
microbial  organism  provides  researchers  with  additional 
tools  for  the  development  of  novel  control  and  treatment 
regimens  and  critical  information  regarding  the  biochem¬ 
istry  of  the  organism.  As  only  a  few  hundred  of  die 
estimated  6000  genes  in  P,  fafriparvm  have  been  identified, 
and  only  a  handful  have  been  identified  as  potential  targets 
of  vaccine*  and  drugs,  chc  completion  of  the  malaria 
genome  will  provide  the  sequence  of  every  potential  drag 
and  vaccine  target  and  will  lie  the  foundation  for  malaria 
research  into  the  next  century. 

Introduction  to  genomic  sequencing 

The  tim  completed  genetic  sequences  were  those  of  small 
viruses  13],  l*hcsc  genomes,  which  were  of  she  order  of 
several  thousand  base  pain;,  were  deciphered  using  the  then 
recently  developed  methods  of  didccocy  nucleic  add  sequen¬ 
cing  and  polyacrylamide  gd  electro  phoresis.  Progressively 
Uqjer  genomes  were  completed  over  the  next  20  years  as 
DMA  sequencing  technology  Improved.  Automated  DMA 
sequencing,  improved  chemistries  and  the  computer  hard¬ 
ware  and  software  needed  to  manage  the  vast  amounts  of 
sequence  data  generated  from  genome  sequencing  projects 
have  made  the  sequencing  of  large  fragments  of  DNA. 
including  entire  genomes,  routine.  Sequencing  centers  arc 
depositing  hundreds  of  thousands  to  millions  of  bases  of 
DNA  sequence  each  week  into  public  databases.  Private 
companies  are  &bo  generating  vast  amounts  of  DNA 
sequence  daw  as  a  means  of  identifying  potential  drug 
raqgeK  fee  the  pharmaceutical  industry. 

Microbial  sequencing  projects 

A  landmark  in  genome  sequencing  was  readied  in  1995 
with  the  publication  of  the  first  completed  genome  from  n 
free-living  organism,  that  of  Hemophilus  influcKxat.  The 
1.8  mb  genome  was  the  largest  such  completed  genome 
project  and  the  first  to  employ  a  Vholc  genome  yhotgun 
approach'  to  an  entire  genome.  Within  one  year,  the 
genomes  of  Mjcapksma  gmito&Km  and  McdmnwKws  jan- 
iHtscftn  were  also  published.  Recently,  rhe  genomes  of 


Earrz&j  tmr&forfm  f4l,  the  causative  agent  of  Lyme  disease; 
HtCicoltacur  pylori  [51«  implicated  in  gastric  ulcers;  and 
Aiymfaaerivm  tuberculosis  [6].  responsible  for  million*  of 
deaths  annually  have  beep  Completed.  Parasite  genome 
projects  have  also  been  staffed,  many  of  which  are  funded 
through  the  WHO/Tropical  Disease  research  program  [7.81 
Genome  data  from  microbial  pathogens  are  being  generated 
at  an  increasing  pace,  and  the  Implication  is  that  this  will 
result  in  a  better  undcntUftdtftg  of  microbial  biology  and 
will  lead  co  more  effective  antimicrobial  therapies.  Human 
puhogeus  are,  however,  not  chc  only  genomes  that  are 
Ireing  sequenced.  Sequencing  is  being  carried  cue  on 
genomes  from  newly  discovered  families  of  organisms,  such 
as  those  tiui  live  in  extreme  conditions  or  those  that  have 
unique  metabolic  requite menu.  For  example,  the  genome 
of  Arrh&cgtobvs  fulprfux,  a  bacterium  that  lives  ac  extremely 
high  temperatures  and  which  metabolites  sulfur  h&s 
recently  been  completed  [9J.  The  repairs  from  these 
projects  and  others  have  shown  remarkable  differences  in 
how  each  organism  lives.  From  nearly  all  completed 
microbial  genome  projects  between  25  w\d  35%  of  each 
genome  codes  for  proteins  that  have  entirely  unknown 
function*.  The  success  of  these  project*  demonstrates  due 
entire  microbial  genomes  can.  be  elucidated  relatively 
quickly  and  economically,  and  have  opened  the  door  to 
additional  microbial  sequencing  projects.  'Vheso  arc  now 
over  13  published  mierobuil  genomes  and  more  than  60 
others  in  the  process  of  bong  sequenced.  For  a  current  list 
of  microbial  genome  projects  by  non-private  sources  see; 
h ctp^/www. tigr ,oig/td b/mdb/mdb,h cml  [10], 

Two  general  approaches  have  been  taken  in  sequencing 
these  first  microbial  genomes.  The  first  approach  known  os 
‘whole  genome  shotgun  sequencing  involves  producing  a 
plasmid  library  by  fragmenting  genomic  DNA  into  random, 
small  (1-2  kb  pairs)  fragments  and  cloning  into  a  sequen¬ 
cing  plasmid.  Clones  from  this  library  arc  chosen  at  random 
and  sufficient  numbers  arc  sequenced  until  an  approxi¬ 
mately  eightfold  coverage  is  achieved.  These  fragment 
sequences  are  then  assembled  into  'contigs’  using  specia¬ 
lized  computer  software  forming  as  near  a  completed 
genome  as  possible.  The  laborious  process  of  closing  gaps 
between  conrigs  generally  requires  a  combination  of 
methods  including  additional  sequencing,  the  polymerase 
chain  icacrion  and  others.  An  advantage  of  chc  whole 
genome  shotgun  method  is  that  little  previous  knowledge 
of  the  genome  is  needed.  As  tire  computational  require¬ 
ments  for  assembling  the  genome  aic  great,  this  strategy  i$ 
generally  reserved  for  genomes  of  die  order  of  a  few 
mcgabascs.  For  larger  genomes,  a  *ecand  approach  is 
traditionally  taken.  Sequencing  centers  Awus  initially  on 
Ihe  constwetion  of  ordered  large  insert  done  libraries  and 
die  construction  of  a  'physical  map’  of  chc  genome.  A 
physical  map  is  crcaocd  by  constructing  clones  (cwmhJs  or 
lambda  vectors)  containing  Urge  ffagments  of  gcnvmic 
DNA  (generally  up  to  40  kb  pairs)  and  then  determining 
the  minima!  overlapping  subset  of  these  denes.  Each  i $ 
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then  shotgun  Cloned  into  sequencing  plasmids  and 
randomly  sequenced.  The  genome  sequence  in  generated 
from  both  the  random  shotgun  and  physical  map  data. 
Although  a  large  initial  effort  is  required  for  the  production 
of  the  physical  map.  la rger  genomes  are  more  easily  tackled 
by  this  method  because  die  computational  requirements  of 
random  sequence  assembly  need  be  applied  only  w  each 
large  insert  done.  As  computational  iuudworfe  and  software 
ate  improved  die  random  fragment  genome  sequencing 
strategy  will  be  applied  to  larger  genomes.  A  icccnc  effort 
has  been  announced  that  will  combine  a  novel  large  insert 
libmry  approach  with  random  shotgun  sequencing  in  order 
to  sequence  the  entire  hniiioii  genome  (>  billion  base  pairs) 
in  3  years  (llj. 

Sequencing  the  malaria  genome 

An  international  consortium  pf  funding  agencies  and 
genome  centers  was  formed  in  19^6,  whose  goal  was 
the  completion  of  the  entire  genomic  sequence  of  A. 
falciparum  (1*21.  Several  important  technical  hurdles  were 
addressed,  including  the  observed  instability  of  large 
fragments  of  P.  falciparum  UNA  in  typical  bacterial 
plasmid  sysrems,  the  forge  genome  sree.  and  thr.  means 
to  partition  tl»e  sequencing  efforts  among  the  members  of 
the  consortium  (13).  The  greatest  perceived  challenge  was 
rhe  foot  that  there  had  been  liede  success  in  the 
production  of  a  bacccrial  plasmid  or  lambda  vector  library 
containing  inserts  more  than  several  kilobases  in  size,  and 
that  plasmids  containing  even  small  inserts  were  often 
rearranged.  The  extreme  nucleotide  composition  of  F. 
falciparum  and  other  Plasmodium  species  with  the  percen¬ 
tage.  of  A  and  T  in  coding  regions  teaching  76%  and  in 
non-coding  regions  sqjpcoacliUig  1 00%  was  draught  u>  l>c 
icsj>o n^blc  for  (tits  instability  in  bacteria  in  which  the  AT 
trompuMitiou  is  of  the  order  of  50%.  This  instability  has 
not.  however,  .been  problematical  in  yeast  systems,  m  fact 
stable  Urge  insert  libraries  have  been  constructed  tiring 
yeast  artificial  chromosomes  (YAC )„  and  these  libraries 
have  been  used  extensively  in  a  Wellcome  Trust-funded 
P.  falciparum  genome  mapping  project.  Unfortunately, 
YAC  clones  arc  generally  considered  to  be  poor  substrates 
for  sequencing,  and  random  shotgun  libraries  derived  from 
purified  YAC  clones  axe  notorious  for  yeast  chromosomal 
DNA  contamination.  Pilot  projects  were  initiated  to 
address  this  instability:  however  ;n  che  abort  term  the 
use  of  v  Urge  insert  library  approach  to  the  sequencing 
project  was  considered  impractical,  although  the  limited 
sequencing  of  YAC  clones  was  considered  go  be 
potentially  useful. 

The  genome  of  P*  falciparum  is  large,  approximately 
30  mb  in  she.  and  is  distributed  on  14  chromosomes, 
which  range  in  site  from  6$0  kb  to  3.4  mb,  and  two 
oxtrachromosomal  elements,  t  35  kb  plnstkJ-likc  dement 
and  a  6  kb  mitochondrial  repeat  element.  As  nppeuxt- 
matcly  80%  of  the  genome  can  bo  nerved  by  \>u\*ct\ 
field  gel  electrophoresis,  die  consortium  agreed  that  die 


genome  should  be  tackled  chromosome  by  chromosome 
using  the  random  shotgun  sequencing  approach.  This 
strategy  facilitated  sequencing  of  the  genome  by 
distributing  the  entire  genome  effort  among  several 
sequencing  centers,  with  each  chromosome  project 
equivalent  to  a  1*3  mb  microbial  sequencing  project. 
Sequencing  of  the  P.  falciparum  genome  is  being  carried 
out  by  The  Institute  for  Genomic  Research  and  Naval 
Medical  Research  Institute  (USA),  The  Sanger  Center 
(UK)  and  Stanford  University  (USA)  (sec  Table  1),  A 
pilot  project  by  The  Institute  for  Genomic  Research 
and  die  Naval  Medical  Research  Institute  was  funded 
by  die  National  Institute  of  Allergy  and  infectious 
Diseases  and  the  US  Department  of  Defense  to 
determine  die  sequence  of  chromosome  2  from  l\ 
falciparum  (1  mb)  and  has  been  completed  (Gardner  MJ 
ct  at,%  submitted).  At  the  rime  of  writing,  the  frequence 
of  chromosome  3  (1.2  mb)  is  nearing  completion  by  die 
Sanger  Centre  UK  with  funding  by  the  Wellcome 
Trwt.  and  th*  ehromosom*  12  project  is  well  underway 
at  Sanford  University  with  funding  by  the  Burroughs 
Wellcome  Fund.  Projects  to  complete  the  remaining  11 
chromosomes  have  been  scared  by  these  rhrcc  sequen* 
dug  centers,  und  die  sequence  aau  being  generated  it> 
released  into  the  public  domain  on  a  regular  basis.  The 
completion  of  these  first  chromosomes  has  validated  the 
approach  adopted  by  the  consortium  and  has  given 
credence  to  the  hope  that  the  entire  genome  will  soon 
be  completed. 

Conclusion 

A  critical  goal  for  the  'next  genome  project*  will  be  the 
use  of  gcuumiu  sequence  duU  derived  from  die  Malaria 
Genome  Project  towards  the  development  of  novel 
untim&larial  drugs  and  vaccines,  as  well  us  eiralrlldung 
a  more  complete  understanding  of  parasite  biology,  hose 
interactions,  iod  immune  evasion  and  drug  resistance 
mechanisms.  The  challenge  is  to  bring  the  vast  amount 
of  genome  sequence  dan  from  die  genome  project  to 
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bear  on  malaria  research.  Tcchaolujdes  such  as  DNA 
microarray*  am  already  being  developed  »  explore  gene 
regulation  and  expression  on  a  gcnomc-wid*  scale  in 
humans  and  microbes  (l  4*,]  5,16].  Using  these  and  other 
technologies  it  will  be  possible  to  study  the  »ci«e*spcdfic 
expression  of  the  entire  f.  fgldpamm  genome.  By 
studying  gene  expression  from  various  differing  pheno¬ 
typic  isolates,  for  example  those  that  differ  hi  their 
resistance  to  %  particular  drug,  a  more  pictavc  vtudy  of 
the  mechanisms  of  drug  resistance  will  he  possible.  For 
vaccine  development,  gene  expression  studies  will  be 
rueful  for  identifying  potential  targets  caprc&ed  in 
bkx>d*«agc  parasites,  and  for  drag  development  they 
may  be  particularly  useful  for  identifying  the  expression 
of  novel  biochemical  targets.  As  a  result  of  currenc 
technical  limitations,  it  may  be  difficult  to  apply  these 
technologies  to  the  study  of  gene  expression  in  some  life 
cycle  stages  thought  to  he  critical  for  vseeinc  develop¬ 
ment.  t.c.  the  infected  hepacocyte.  Furthermore,  gene 
expression  studies  alone  may  provide  Unde  informs  don 
regarding  the  subcellufnr  localization  of  the  translated 
protein  and  may  not  be  well  correlated  with  protein 
expression,  especially  secreted  proteins  {17J.  Other 
strategies  will  be  needed  to  identify,  characterize  and 
validate '  the  thousands  of  vaccine  targets  that  will  be 
identified  from  die  genome  project.  A  ‘big  jeuicnoe’ 
approach  comparable  to  the  magnitude  of  die  genome 
project  w\U  he  required.  For  vaccine  design,  aa  approach 
has  been  proposed  to  produce  UNA  vaccines  against 
e»uh  individual  open  reading  home,  and  by  immunizing 
groups  of  mice  u>  produce  antibodies  against  the 
encoded  protein  12).  The sc  antibodies  can  then  he  used 
co  screen  protein  expression  at  each  stage  of  the  parasite 
by  immune  fluorescence  testing  and  then  characterize 
protein  expression  and  location.  A  subset  of  vaccine 
candidates  tire  then  selected  on  the  basis  of  die  pattern 
of  protein  expression  at  the  desired  stage  of  the  life 
cycle.  Ultimately,  the  wlegdon  of  optimal  vaccine  and 
drug  candidates  and  the  development  of  an  optimal 
amaroatarial  vaccine  and  drug  wilt  require  a  combination 
of  computer  modelling,  biomfomiatioi.  gene  expression 
studies  and  protein  expression  analyses,  but  the  comple¬ 
tion  of  the  entile  genome  of  P.  falciparum  will  provide 
the  foundation  for  all  these  studies  and  give  hope  chat 
control  of  this  devastating  disease  is  within  reach. 
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ABSTRACT 


Detailed  restriction  maps  of  microbial  genomes  are  a  valuable  resource  in  genomic 
sequencing  studies  but  are  toilsome  to  construct  by  combining  maps  derived  from  cloned 
DNA.  Analysis  of  genomic  DNA  enables  large  stretches  of  the  genome  to  be  mapped  and 
circumvents  library  construction  and  associated  cloning  artifacts.  We  used  pulsed  field 
gel-purified  P.  falciparum  chromosome  2  DNA  as  the  starting  material  for  Optical 
Mapping,  an  approach  for  making  ordered  restriction  maps  from  ensembles  of  single 
DNA  molecules.  DNA  molecules  were  bound  to  derivatized  glass  surfaces,  cleaved  with 
Nhe  I  or  BamH  I  and  imaged  by  fluorescence  microscopy.  Large  pieces  of  the 
chromosome  containing  many  DNA  fragments  were  mapped.  Maps  were  assembled  from 
50-60  molecules  generating  an  average  contig  depth  of  15  molecules  and  a  high 
resolution  consensus  restriction  map  was  generated.The  maps  were  used  as  a  means  to 
verify  assemblies  from  the  plasmid  library  used  for  sequencing.  Maps  generated  in  silico 
from  the  sequence  data  also  corresponded  with  the  optical  maps.  Such  high  resolution 
restriction  maps  may  become  an  indispensable  resource  for  large-scale  genome 
sequencing  projects. 

INTRODUCTION 

Optical  mapping  is  a  system  for  the  construction  of  ordered  restriction  maps  from 
single  molecules  (Schwartz  et  al.  1993,  Samad  et  al.  1995).  Individual  DNA  molecules 
bound  to  derivatized  glass  surfaces,  which  have  been  cleaved  with  restriction  enzymes, 
are  imaged  by  fluorescence  microscopy.  Cut  sites  are  visualized  as  gaps  between  cleaved 
DNA  fragments  which  retain  their  original  order  (Cai  et  al.  1995,  Cai  et  al.  1998).  Optical 
Mapping  has  been  used  to  prepare  maps  of  a  number  of  large  insert  clone  types  such  as 
bacterial  artificial  chromosomes  (Cai  et  al.  1998)  and  most  recently  genomic  DNA  (Lin 
et  al.  1998).  Large  fragments  of  randomly  sheared  DNA  are  mapped  with  high  cutting 
efficiency  and  the  many  overlapping  restriction  site  landmarks  allow  contigs  to  be 
assembled.  A  shotgun  mapping  strategy  can  thus  be  employed.  Optical  mapping  of 
genomic  DNA  enables  large  stretches  of  a  genome  to  be  mapped,  which  simplifies  contig 


formation.  Library  construction  is  obviated  enabling  mapping  of  organisms  such  as 
Plasmodium  falciparum  (P. falciparum)  with  AT-rich  DNA,  which  is  difficult  to  clone. 
Also,  cloning  artifacts  are  precluded  enabling  more  accurate  maps  to  be  generated. 
Furthermore,  small  amounts  of  starting  material  are  required  facilitating  the  mapping  of 
parasites,  which  are  problematic  to  culture.  Such  restriction  maps  provide  a  picture  of  the 
architecture  of  large  spans  of  the  genome  and  have  value  in  shotgun  sequencing.  They 
provide  an  ideal  scaffold  for  sequence  assembly,  finishing  and  verification.  Gaps  between 
contigs  can  be  characterized  in  terms  of  location  and  breadth,  thereby  facilitating  closure 
techniques. 

Sequencing  of  chromosome  2  of  P.  falciparum  was  recently  completed  by 
Gardner  and  colleagues  (Gardner  et  al.  1998),  as  part  of  an  international  consortium 
sequencing  the  whole  P.  falciparum  genome  (Foster  1995,  Dame  et  al.  1996).  Existing 
physical  maps  of  P.  falciparum  chromosomes  (chromosome  3  [Thompson  and  Cowman, 
1997]  and  chromosome  4;  [Watanabe  and  Inselberg,  1994,  Sinnis  and  Wellems,  1988]), 
prepared  by  restriction  digestion,  gel  fingerprinting  and  hybridization  of  probes  are  low 
resolution  and  not  suitable  for  sequence  verification.  In  order  to  investigate  the  feasibility 
of  optical  mapping  of  a  whole  eukaryotic  chromosome  we  constructed  high  resolution, 
ordered  restriction  maps  P.  falciparum  chromosome  2  using  genomic  DNA  and  later 
compared  these  maps  with  those  generated  in  silico  from  the  sequence  data. 


MATERIALS  AND  METHODS 


Parasite  preparation 

P.  falciparum  (clone  3D7)  was  cultivated  using  standard  techniques  (Trager  and 
Jensen,  1976).  In  order  to  minimize  possible  alterations  of  the  genome  that  can  occur  in 
continuous  culture  (Corcoran  et  al.  1986),  parasite  aliquots  were  kept  frozen  in  liquid  N2 
until  needed  and  then  cultivated  only  as  long  as  necessary.  Parasites  were  cultivated  to 
late  trophozoite/early  schizont  stages  and  enriched  on  a  Plasmagel  gradient.  The 
parasitized  red  blood  cells  were  washed  once  with  several  volumes  of  10  mM  Tris  (pH 
8),  0.85%  NaCl  and  the  parasites  were  freed  from  the  erythrocytes  by  incubation  in  ice 
cold  0.5%  acetic  acid  in  dHzO  for  5  min,  followed  by  several  washes  in  cold  buffer.  The 
parasites  were  resuspended  to  a  concentration  of  2  x  1 0*Vml  in  buffer  and  maintained  in  a 
50°C  waterbath.  An  equal  volume  of  1%  InCert  agarose  in  buffer,  prewarmed  to  50°C, 
was  mixed  with  the  prewarmed  parasites  and  the  mixture  was  added  to  a  1  cm  x  1  cm  x 
10  cm  gel  mold,  plugged  at  one  end  with  solidified  Insert  agarose,  and  was  allowed  to 
cool  to  4°C.  The  agarose-embedded  parasites  were  pushed  out  of  the  mold  and  incubated 
with  50  ml  proteinase  K  solution  (2  mg/ml  proteinase  K  in  1%  sarkosyl,  0.5  M  EDTA)  at 
50°C  for  48  hrs  with  one  change  of  proteinase  K  solution  and  were  stored  in  50  mM 
EDTA  at  4°C. 

Chromosome  2  isolation  by  pulsed  field  gel  electrophoresis  (PFGE) 

Uniform  parasite  slices  were  taken  with  a  glass  coverslip  using  two  offset 
microscope  slides  as  guides.  One  half  to  one  quarter  of  a  single  slice  was  sufficient  per 
lane.  Parasite  slices  were  arranged  end  to  end  on  the  flat  side  of  the  gel  comb.  The 
parasites  were  fixed  to  the  comb  by  a  small  bead  of  molten  (60°C)  agarose.  The  comb 
was  then  placed  into  the  gel  mold  and  molten  agarose  (1.2%  SeaPlaque[FMC,  Rockland, 
ME]  in  0.5x  TBE)  poured  around  the  parasite-containing  slices.  Once  cooled,  the  comb 
was  removed  and  the  space  filled  with  molten  agarose.  A  CHEF  DRIII  apparatus  (Bio- 
Rad,  Hercules  ,CA)  was  used  for  all  chromosome  separations.  Gels  were  run  with  180- 
250  sec  ramped  pulse  time  at  3.7  V/cm  and  120°  field  angle,  for  90  hrs  at  14°C  with 


recirculating  buffer  at  approximately  1 1/min,  using  Saccharomyces  cerevisiae  and/or 
Hansenula  wingei  CHEF  size  markers  (Bio-Rad).  To  minimize  UV  damage  to  the  DNA, 
gel  slices  were  removed  from  the  ends  of  the  gel,  stained  with  ethidium  bromide  (5pg/ml) 
and  visualized  by  long  wave  (320  nm)  UV  light.  Notches  corresponding  to  the  individual 
chromosomes  were  made  in  the  agarose  gel  and  used  as  guides  to  cut  the  chromosome 
from  the  gel.  The  chromosome-containing  gel  slices  were  stored  in  50  mM  EDTA  at  4°C 
until  needed.  The  gel  was  stained  with  ethidium  bromide  to  verify  the  chromosome 
excision.  The  genome  of  P.  falciparum  is  approximately  30  Mb  in  size,  consisting  of  14 
chromosomes  ranging  in  size  from  0.6-3.5  Mb  (Foote  and  Kemp  1989).  PFGE  resolves 
most  of  the  P.  falciparum  chromosomes,  except  5-9  which  are  similar  sizes  and  co¬ 
migrate.  The  gel  band  containing  Plasmodium  falciparum  chromosome  2  was  easily 
resolved,  cut  from  the  gel,  melted  at  72°C  for  seven  minutes  and  incubated  with  agarase 
at  40°C  for  two  hours.  The  melted  agarose  band  was  diluted  in  TE  to  a  final  DNA 
concentration  suitable  for  optical  mapping  (~20  pg/fxl). 

Mounting  and  digestion  of  DNA  on  optical  mapping  surface 

Optical  mapping  surfaces  were  prepared  as  previously  described  (Aston  et  al. 
1998).  Briefly,  glass  coverslips  (18x18  mm2;  FISHERfinest,  Pittsburgh,  PA)  were 
cleaned  by  boiling  in  concentrated  nitric,  then  hydrochloric  acid.  Surfaces  were 
derivatized  with  3-aminopropyldiethoxymethyl  silane  (APDEMS;  Aldrich  Chemical, 
Milwaukee,  WI).  One  surface  was  placed  onto  a  microscope  slide.  10  pi  of  DNA  sample 
was  added  to  the  edge  between  the  surface  and  the  slide  and  spread  into  the  space 
between  the  surface  and  the  slide.  The  surface  was  then  peeled  off  from  the  slide. 
Digestion  was  performed  by  adding  100  jil  of  digestion  solution  (50  mM  NaCl,  10  mM 
Tris  HCl  [pH  7.9],  10  mM  MgCl2, 0.02%  Triton-x  100,  20  units  of  restriction 
endonuclease;New  England  Biolabs,  Beverly,  MA)  onto  the  surface  and  incubating  at 
37°C  from  15  min  to  30  min.  The  buffer  was  aspirated  and  the  surface  washed  with  water 
before  staining  of  DNA  with  YOYO-1  (Molecular  Probes,  Eugene,  OR)  homodimer, 
prior  to  fluorescence  microscopy.  Co-mounted  lambda  bacteriophage  DNA  was  used  as  a 
sizing  standard  and  also  to  estimate  cutting  efficiencies. 


Image  acquisition,  processing  and  map  construction 

DNA  molecules  were  viewed  by  fluorescence  microscopy.  The  Optical  Mapping 
surface  was  scanned  by  the  operator  for  individual  digested  DNA  molecules  of  adequate 
length  and  quality  to  be  collected  for  image  processing  and  map  making.  Images  were 
collected  with  a  charge  coupled  device  (CCD)  camera  (Princeton  Instruments,  Trenton, 
NJ)  using  Optical  Map  Maker  (OMM)  software,  as  previously  described  (Jing  et  al. 

1998).  Images  of  DNA  fragments  were  processed  using  a  modified  version  of  NIH  Image 
(Huff,  Ph.D.  thesis)  which  integrates  fluorescence  intensity  for  each  fragment.  These 
values  were  used  to  assemble  an  ordered  restriction  map  for  each  molecule.  Maps  were 
manually  constructed  using  Microsoft  Excel  spreadsheets.  Fluorescence  intensity  of 
lambda  bacteriophage  DNA  standards  was  used  to  measure  the  size  of  the  P.  falciparum 
restriction  fragments  on  a  per  image  basis.  Cutting  efficiences  (on  a  per  image  basis) 
were  determined  from  scoring  cut  sites  on  sizing  standard  molecules  contained  in  the 
same  field  as  the  genomic  DNA  molecules.  Standard  molecules  were  cut  once  by  Nhe  I 
and  five  times  by  BamH  I.  The  map  for  the  entire  chromosome  2  was  manuallyassembled 
into  contigs  by  aligning  overlapping  regions  of  congruent  cut  sites.  If  there  were  no 
overlapping  regions,  the  molecules  were  considered  to  be  from  a  contaminating  P. 
falciparum  chromosome  and  were  discarded.  Consensus  maps  for  chromosome  2  were 
assembled  by  averaging  the  fragment  sizes  from  the  individual  maps  derived  from  maps 
underlying  the  contigs. 

Southern  blotting  of  P.  falciparum  genomic  DNA 

10  pg  P.  falciparum  genomic  DNA  was  digested  with  Nhe  I  or  BamH  I,  resolved 
by  PFGE  (POE  apparatus,  box  length  20  cm,  1%  gel  in  0.5x  TBE,  pulse  time  1  sec;  2  sec, 
switch  time  12  sec,  150  volts,  for  24  h)  (Schwartz  and  Koval,  1989),  blotted  and 
hybridized  with  probes  derived  from  small  insert  clones  used  for  sequencing  (PF2CM93 
and  PF2NA66).  Probes  were  labeled  by  random  priming. 

RESULTS 

P.  falciparum  chromosome  2  DNA  sample 


A  chromosome  2  gel  slice  was  used  as  starting  material.  Despite  the  AT-rich 
nature  of  the  P.  falciparum  genome  (80-85%),  melting  of  low  gelling  temperature 
agarose  inserts  did  not  affect  the  integrity  of  the  DNA  and  the  chromosome  was 
competent  for  optical  mapping. 

Improved  DNA  mounting  technique 

Previously,  we  mounted  DNA  sample  onto  optical  mapping  surface  by 
sandwiching  the  sample  between  an  optical  mapping  surface  and  a  microscope  slide  and 
then  peeling  the  surface  from  the  slide.  DNA  molecules  were  stretched  and  fixed  onto  the 
surface.  This  method  works  very  well  with  bacteriophages,  cosmids  and  B  ACs  (Cai  et  al. 
1995,  Cai  et  al.  1998);  however,  larger  genomic  DNA  molecules  tend  to  form  crossed 
molecules.  We  improved  this  approach  by  adding  the  sample  to  the  edge  formed  by 
placing  a  surface  onto  a  slide.  The  liquid  DNA  sample  spread  into  the  space  between  the 
surface  and  the  slide  by  capillary  action.  Consequently,  DNA  breakage  was  minimized, 
molecules  tended  be  stretch  in  the  same  direction,  and  crossed  molecules  were  also 
minimized  (see  Fig.  1). 

Nhe  I  and  BamU  I  maps  for  P.  falciparum  chromosome  2 

The  genomic  DNA  was  mapped  with  either  Nhe  I  (Fig.  1  A)  or  BamU  I  (Fig.  IB). 
Fragment  sizes  were  calculated  by  comparison  with  co-mounted  lambda  bacteriophage 
DNA  (48.5  kb).  P.  falciparum  DNA  has  an  AT  content  of  80-85%  and  lambda 
bacteriophage  DNA  has  an  AT  content  of  50%.  The  YOYO-1  fluorochrome  used  for 
DNA  staining  preferentially  intercalates  between  GC  pairs.  A  correction  factor  was 
therefore  applied  to  each  fragment  size  to  correct  for  massively  different  fluorochrome 
incorporation  (Netzel  et  al.  1995).  Lambda  bacteriophage  DNA  was  also  used  to 
determine  areas  on  the  surface  where  cutting  efficiency  was  highest.  Cutting  efficiencies 
were  in  excess  of  80%.  Maps  were  obtained  from  individual  molecules  of  about  350  kb. 
Consensus  maps  were  assembled  from  50-60  molecules  generating  an  average  contig 
depth  of  15  molecules.  Chromosome  2  was  found  to  be  976  kb  by  optical  mapping  with 
Nhe  I  and  946  kb  by  optical  mapping  with  BamU  I  (average  size  961  kb).  There  were  40 
fragments  in  the  Nhe  I  map,  ranging  from  1.5  kb- 1 15  kb,  with  average  fragment  size  24 


kb  (Fig.  2A).  There  were  30  fragments  in  the  BamH  I  map  ranging  from  0.5  kb-80  kb, 
with  average  fragment  size  32  kb  (Fig.  2B).  Each  fragment  size  in  the  consensus  map  was 
averaged  from  10-15  fragments.  Although  P.  falciparum  chromosome  2  migrates  as  a 
distinct  band  by  PFGE,  we  found  the  gel  slice  to  contain  only  60%  chromosome  2- 
specific  DNA.  The  remaining  optical  mapping  data  was  rejected. 

Integration  of  optical  maps  and  sequence  data 

The  chromosome  2  sequence  assembled  by  Gardner  and  colleagues  (Gardner  et 
al.  1998)  is  a  large  contig  covering  most  of  the  chromosome.  The  optical  restriction  maps 
were  compared  to  restriction  maps  predicted  from  the  sequence,  and  there  was  very  good 
correspondence  between  the  two,  indicating  that  there  were  no  major  rearrangements  or 
errors  in  the  assembled  sequence  (Table  1).  The  optical  map  included  all  fragments  above 
500  bp  predicted  from  sequence.  The  overall  agreement  between  these  maps  and  the 
sequence  was  therefore  excellent,  with  the  average  fragment  size  difference  below  600  bp 
(relative  error  4.3%)  for  the  Nhe  I  map.  The  average  fragment  size  difference  for  the 
BamH  I  map  was  1.2  kb  (relative  error  5.8%).  However,  there  were  several  notable 
differences.  Large  differences  in  size  for  the  fragments  at  each  end  of  the  chromosome 
were  noted  fragments  (Tables  1  and  2).  This  is  because  the  sequence  for  these 
subtelomeric  regions  is  still  under  construction.  PCR  products  spanning  subtelomeric 
gaps  are  currently  being  sequenced.  The  optical  map  sizes  were  larger  than  those 
predicted  from  sequence  for  certain  other  fragments  (Tables  1  and  2).  These  differences 
were  due  to  falsely  large  fluorescence  intensity  measurements  caused  by  crossed 
molecules.  Currently,  we  integrate  length  measurements  with  fluorescence  intensity 
measurements  to  improve  on  our  sizing  of  these  fragments.  Chromosome  2  maps  using 
these  new  measurements  show  no  exceptional  errors  (not  shown;  work  in  progress).  The 
map  was  used  to  facilitate  sequence  verification.  Optical  maps  can  also  be  used  at  the 
earlier  sequence  assembly  stage  to  form  a  scaffold  for  assembly  of  contigs  formed  from 
sequencing. 


Map  confirmation  by  southern  blotting 


In  order  to  confirm  the  optical  maps,  pulsed  field  gels  of  total  P.  falciparum  DNA 
digested  with  Nhe  I  or  BamR  I  were  generated.  Plasmid  clones  used  as  sequencing 
templates  were  used  as  probes  on  southern  blots  of  the  gels.  Restriction  fragment  sizes  of 
the  blots  were  closely  comparable  in  size  to  the  fragments  seen  on  the  optical  maps  and 
those  predicted  from  the  preliminary  sequence.  Probe  PF2CM93  hybridized  to  a  7.5  kb 
band  generated  by  Nhe  I  digestion  and  PFGE.  The  fragment  size  predicted  from  sequence 
information  was  7.6  kb.  The  corresponding  fragment  size  from  the  optical  map  was  also 
7.6  kb  (Table  1).  The  same  probe  hybridized  to  a  41  kb  band  generated  by  BamR  I 
digestion  and  PFGE.  The  fragment  size  predicted  from  sequence  information  was  41.3 
kb.  The  corresponding  fragment  size  from  the  optical  map  was  40.8  kb  (Table  2).  Probe 
PF2NA66  also  generated  data  with  fragment  sizes  that  were  very  similar  (Tables  1  and 
2).  By  using  the  same  probe  on  DNA  digested  with  the  two  different  enzymes,  the  optical 
maps  were  oriented  and  linked  with  one  another. 

DISCUSSION 

We  have  generated  a  high  resolution  optical  restriction  map  of  P.  falciparum 
chromosome  2  which  was  used  sequence  verification.  The  maps  can  also  be  used  for  final 
sequence  assembly  (Cai  et  al.  1998).  The  fidelity  of  the  optical  maps  was  checked  by 
using  southern  blotting.  Firstly,  this  enabled  the  optical  maps  to  be  cross  checked  against 
the  sequence.  This  approach  could  be  useful  for  assembling  data  acquired  using  different 
techniques  and  would  allow  the  placement  of  very  short  sequence  contigs  onto  a  map. 
Secondly,  the  two  maps  were  also  oriented  and  linked  relative  to  each  other.  Linking  of 
single-enzyme  maps  produces  a  much  higher  resolution  multi-enzyme  map  which  is  rich 
in  information.  Smaller  contigs  can  be  placed  on  a  multienzyme  map  (Cai  et  al.  1998). 
This  approach  could  also  be  used  to  assign  STS  markers  or  ESTs  to  restriction  fragments 
on  a  whole  genome  optical  map. 

Despite  the  fact  chromosome  2  is  easily  resolved  by  PFGE,  we  found  the 
chromosome  2  gel  slice  to  contain  only  60%  chromosome  2-specific  DNA. 

Consequently,  a  lot  of  the  optical  mapping  data  was  rejected.  Should  we  have  mapped 
other  chromosomes  using  the  same  strategy  we  could  not  predict  the  acquisition  of  clean 


data  from  chromosomes  which  are  less  resolvable  by  PFGE,  such  as  chromosomes  5-9. 
Current  optical  mapping  studies  on  P.  falciparum  use  whole  genomic  DNA  as  starting 
material.  The  chromosomes  are  resolved  at  the  level  of  data  rather  than  as  physical 
entities.  The  data  segregates  into  14  deep  contigs  corresponding  to  the  various 
chromosomes.  Chromosome  2  can  be  resolved  based  on  size  and  the  near  complete 
correspondence  with  the  data  shown  in  this  paper  (one  600  bp  BamVL  I  fragment  is 
missing  on  the  whole  genome  map).  The  success  of  this  project  has  prompted  the  malaria 
genome  consortium  to  recommend  funding  of  whole  genome  mapping  to  assist  in  closure 
of  chromosomes,  as  well  as  for  verification  of  the  final  assembly. 

In  summary,  we  describe  the  construction  of  an  ordered  restriction  map  of  P. 
falciparum  chromosome  2  using  optical  mapping  of  genomic  DNA.  A  combined 
approach  using  shotgun  sequencing  and  optical  mapping  will  enable  sequence  assembly 
and  finishing  of  large  and  complex  genomes. 
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Table  1.  Comparison  of  Nhe  I  optical  map  with  restriction  map  predicted  from  sequence. 


Table  2.  Comparison  of  BamH  I  optical  map  with  restriction  map  predicted  from 
sequence. 


FIGURE  LEGENDS 


Fig.  1.  Typical  P. falciparum  chromosome  2  molecules  and  their  corresponding  optical 
maps.  (A)  digested  with  Nhe  I  (B)  digested  with  BamE  I.  Maps  derived  from  the  two 
BamE  I-digested  molecules  in  (B)  can  be  aligned. 

Fig.  2.  High  resolution  Optical  Mapping  of  P.  falciparum  chromosome  2  using  (A)  Nhe 
and  (B)  BamE  I.  The  underlying  contig  used  to  generate  the  consensus  map  is  shown. 
The  map  predicted  from  sequence  information  is  shown  for  comparison. 
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Abstract 


Computational  gene  finding  research  has  emphasized  the  development  of  gene  finders 
for  bacterial  and  human  DNA.  This  has  left  genome  projects  for  some  small  eukaryotes 
without  a  system  that  addresses  their  need.  This  paper  reports  on  a  new  system,  Glim- 
merM,  that  was  developed  to  find  genes  in  the  malaria  parasite  P.  falciparum.  Because 
the  gene  density  in  P.  falciparum  is  relatively  high,  the  system  design  was  based  on  a 
successful  bacterial  gene  finder,  Glimmer.  The  system  was  augmented  with  specially 
trained  modules  to  find  splice  sites,  and  was  trained  on  all  available  data  from  the  P.  fal¬ 
ciparum  genome.  Although  a  precise  evaluation  of  its  accuracy  is  impossible  at  this  time, 
laboratory  tests  (using  RT-PCR)  on  a  small  selection  of  predicted  genes  confirmed  100% 
of  those  predictions.  With  the  rapid  progress  in  sequencing  the  genome  of  P.  falciparum 
the  availability  of  this  new  gene  finder  will  greatly  facilitate  the  annotation  process. 
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1  Introduction 


The  gene  finding  research  community  has  focused  considerable  attention  on  human  gene 
finding  and  bacterial  gene  finding.  This  is  not  surprising  given  the  attention  paid  to 
both  areas.  The  Human  Genome  Project  has  produced  many  millions  of  nucleotides  of 
sequence,  and  the  importance  of  rapidly  identifying  the  genes  in  this  sequence  cannot 
be  overstated.  This  task  is  made  difficult  by  the  fact  that  only  one  to  three  percent 
of  human  genomic  sequence  is  estimated  to  code  for  proteins.  On  the  bacterial  side, 
eighteen  complete  bacterial  and  archaeal  genomes  have  already  been  published,  with 
dozens  more  expected  in  the  next  two  years.  Gene  finders  for  these  prokaryotes  have 
an  advantage  in  that  approximately  90%  of  the  DNA  of  these  genomes  is  coding;  thus 
the  task  reduces  in  many  cases  to  choosing  between  competing  reading  frames.  On  the 
other  hand,  the  demand  for  accuracy  is  correspondingly  much  higher  in  the  prokaryotic 
world. 

In  between  these  two  genomic  worlds  lies  a  vast  array  of  eukaryotic  organisms  whose 
genomes  range  in  size  from  that  of  a  large  prokaryote  (on  the  order  of  tens  of  millions 
of  nucleotides)  to  those  that  are  larger  than  human  (billions  of  nucleotides).  Their  gene 
density  tends  to  be  much  lower  than  that  of  bacteria,  but  many  organisms  have  a  much 
higher  gene  density  than  humans.  The  only  eukaryotic  genome  that  has  been  completely 
sequenced,  that  of  the  yeast  Saccharomyces  cerevisiae,  has  approximately  one  gene  every 
five  kilobases.  This  corresponds  to  a  gene  density  of  20%.  Recently,  chromosome  2  of  the 
malaria  parasite  Plasmodium  falciparum  was  completed  (Gardner  et  al.  1998),  and  this 
organism  too  has  a  gene  density  of  20%.  The  remaining  13  chromosomes  from  malaria 
should  be  completed  over  the  course  of  the  next  few  years.  The  much  larger  (120  million 
nucleotides)  genome  of  arabidopsis,  which  also  is  expected  to  have  a  gene  density  of 
approximately  20%,  should  be  completed  in  the  same  time  frame,  and  many  projects 
are  under  way  to  sequence  other  small  eukaryotes. 
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Because  of  their  relatively  high  gene  density  with  respect  to  human  DNA,  using  a 
gene  finder  developed  for  human  sequence  (or  other  organisms  with  low  gene  density, 
including  most  vertebrates  and  larger  plant  genomes)  may  not  be  the  optimal  approach 
for  P.  falciparum  and  other  small  eukaryotes.  Prokaryotic  gene  finders  are  not  well- 
suited  to  this  task  because  of  their  inability  to  handle  introns.  It  is  possible  to  re-train 
human  gene  finders  using  different  data  (for  example,  GENSCAN  (Burge  and  Karlin 
1997)  has  been  trained  with  arabidopsis  data),  but  one  still  runs  the  risk  that  because 
these  systems  have  been  optimized  to  find  genes  in  DNA  that  is  only  3%  coding,  they 
may  miss  many  genes  in  genomes  such  as  P.  falciparum. 

This  paper  describes  a  gene  finder  developed  specifically  for  small  eukaryotes  with  a 
gene  density  of  around  20%.  This  system,  GlimmerM,  was  built  and  trained  using  data 
from  P.  falciparum,  the  malaria  parasite.  It  was  then  used  as  the  principal  gene  finder 
for  chromosome  2  of  P.  falciparum,  which  contains  210  genes  (209  protein  coding  genes 
plus  one  tRNA)  (Gardner  et  al.  1998).  Most  of  these  genes  were  found  by  GlimmerM, 
and  as  described  below,  some  were  confirmed  by  additional  laboratory  experiments. 

2  Methods  and  algorithms 

The  basis  of  GlimmerM  is  a  dynamic  programming  algorithm  that  considers  all  com¬ 
binations  of  possible  exons  for  inclusion  in  a  gene  model,  and  chooses  the  best  of  these 
combinations.  The  decision  about  what  model  is  best  is  a  combination  of  the  strength 
of  the  splice  sites  and  the  score  of  the  exons  produced  by  an  interpolated  Markov  model 
(IMM).  The  methods  for  producing  these  scores  are  described  next,  followed  by  the 
description  of  the  dynamic  programming  algorithm  and  the  splice  site  recognition  pro¬ 
cedures. 


4 


2.1  Interpolated  Markov  models 

Markov  chains  are  a  family  of  methods  for  computing  the  probability  of  an  event  based 
on  a  fixed  number  of  previous  events.  In  the  context  of  DNA  sequence  analysis,  Markov 
chains  predict  a  base  by  examining  a  fixed  number  of  bases  just  prior  to  that  base  in  the 
sequence.  The  most  common  type  of  Markov  chain  is  a  fixed-order  chain,  in  which  the 
number  of  previous  bases  to  examine  is  specified  in  advance.  For  example,  a  5th-order 
Markov  chain  will  predict  a  base  by  looking  at  the  five  previous  bases.  Markov  chains, 
and  5th-order  chains  in  particular,  have  proven  to  be  effective  at  gene  prediction  in 
bacterial  genomes  (Borodovsky  and  Mcininch  1993;  Borodovsky  et  al.  1995). 

Interpolated  Markov  models  (IMM)  are  an  improvement  on  fixed-order  Markov 
chains.  The  main  distinction  is  that  rather  than  deciding  in  advance  how  many  bases 
to  consider  for  each  prediction,  these  models  will  use  varying  numbers  of  bases  for  each 
prediction.  In  some  contexts  they  will  use  5  bases,  while  in  others  they  might  use  6  or 
more  bases,  and  in  yet  other  cases  they  may  use  4  or  fewer  bases.  This  allows  IMMs 
to  be  sensitive  to  how  common  a  particular  oligomer  is  in  a  given  genome.  In  a  given 
genome,  many  5-mers  might  occur  rarely  and  should  not  be  used  for  prediction;  here 
the  IMM  will  fall  back  on  a  shorter  Markov  chain.  On  the  other  hand,  certain  8-mers 
may  occur  very  frequently,  and  for  those  the  IMM  can  use  this  longer  context  and  make 
a  better  prediction.  In  addition,  the  IMM  can  combine  the  evidence  from  the  8t,l-order 
Markov  chain  and  the  5th -order  chain  in  such  cases.  Thus  it  has  all  the  information 
available  to  a  5t/l-order  chain  plus  additional  information.  It  is  also  worth  noting  both 
both  IMMs  and  5t,l-order  Markov  chains  should  outperform  methods  based  on  codon 
usage  statistics.  (Cf.  (Saul  and  Battistutta  1988),  a  codon  usage  method  specific  to 
P.  falciparum.  Note  that  at  the  time  of  that  work,  much  less  Plasmodium  data  was 
available,  and  higher-order  statistics  might  have  been  inaccurate  as  a  result.) 

IMMs  form  the  basis  of  the  Glimmer  system  for  finding  genes  in  bacteria  and  archaea 
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(Salzberg  et  al.  1998).  Glimmer  correctly  identifies  approximately  98%  of  the  genes 
in  bacteria  without  any  human  intervention,  and  with  a  very  limited  number  of  false 
positives.  It  has  been  used  as  the  gene  finder  for  B.  burgdorferi  (Fraser  et  al.  1997),  T. 
pallidum  (Fraser  et  al.  1998),  C.  trachomatis  (Stephens  et  al.  1998),  T.  maritima  {et  al. 
1999, ),  and  others.  Based  on  the  success  of  Glimmer  in  bacterial  sequence  annotation, 
we  thought  that  IMMs  should  make  a  good  foundation  for  eukaryotic  gene  finding.  This 
is  particularly  true  of  small  eukaryotes  like  P.  falciparum  in  which  the  gene  density  is 
intermediate  between  that  of  prokaryotes  and  higher  eukaryotes. 

Details  of  how  to  construct  an  IMM  for  sequence  data  can  be  found  in  the  original 
Glimmer  publication  (Salzberg  et  al.  1998);  GlimmerM  uses  the  same  IMM  algorithm 
as  the  one  described  there.  In  brief,  GlimmerM  builds  IMMs  from  a  set  of  DNA 
sequences  chosen  for  training.  For  coding  regions,  it  builds  3  separate  IMMs,  one  for 
each  codon  position.  (This  is  known  as  a  3-periodic  Markov  model  (Borodovsky  and 
Mcininch  1993).)  These  IMMs  include  0t/l  through  8th  order  Markov  chains,  as  well  as 
weights  computed  for  every  oligomer  of  8  bases  or  less  that  appears  in  the  training  data. 
These  weights  and  Markov  models  are  interpolated  to  produce  a  score  for  each  base  in 
any  potential  coding  sequence.  The  logs  of  these  scores  are  summed  to  score  each  coding 
region. 

2.2  Dynamic  programming 

GlimmerM  uses  dynamic  programming  (DP)  to  find  genes  in  the  DNA  sequence  of 
a  malaria  chromosome.  DP  allows  it  to  prune  out  a  large  number  of  possible  exon- 
intron  combinations  and  focus  its  analysis  only  on  relatively  high-scoring  combinations 
(called  “parses”).  The  input  to  the  algorithm  is  any  genomic  DNA  sequence  in  a  FASTA 
format;  small  sequences  as  well  as  entire  chromosomes  can  be  input.  The  output  is  a 
partitioning  of  the  the  DNA  into  coding  regions  interleaved  with  noncoding  regions,  on 
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both  the  main  and  complementary  strands  of  the  sequence. 

Wrapped  inside  GlimmerM  is  the  interpolated  Markov  model  described  above, 
which  is  used  to  score  candidate  exons.  The  predicted  genes  are  optimal  with  respect  to 
the  scores  produced  by  the  IMM. 

Dynamic  programming  has  been  the  basis  of  many  successful  eukaryotic  gene  finders. 
Hidden  Markov  model  (HMM)  systems  use  a  DP  algorithm  called  Viterbi  that  is  a  special 
case  of  the  algorithm  here;  these  HMM  methods  include  VEIL  (Henderson  et  al.  1997); 
GENSCAN  (Burge  and  Karlin  1997),  which  uses  semi-Markov  HMMs;and  Genie  (Kulp 
et  al.  1996),  which  uses  generalized  HMMs.  Very  recently,  Wirth  descrbed  a  gene  finder 
for  P.  falciparum  based  on  generalized  HMMs  (Wirth  1998),  but  it  is  not  yet  available 
for  comparison.  The  Morgan  system  (Salzberg  et  al.  1996;  Salzberg  et  al.  1998)  uses  a 
DP  algorithm  as  a  wrapper  around  its  decision  tree  program,  and  GeneParser  (Snyder 
and  Stormo  1995)  uses  DP  wrapped  around  a  neural  network  program.  These  latter  two 
DP  formulations  are  most  similar  to  the  one  used  for  GlimmerM. 

As  in  many  other  gene  finders  (Salzberg  1998),  there  are  a  number  of  assumptions 
used  by  GlimmerM  when  predicting  genes  in  the  DNA  sequence.  These  assumptions 
are  derived  from  biological  constraints  on  mRNA  splicing,  transcription,  and  translation. 
Consequently,  we  assume  that: 

•  each  coding  region  of  a  gene  begins  with  a  start  codon  ATG, 

•  a  gene  has  exactly  one  in-frame  stop  codon,  which  appears  as  the  last  codon  in 
the  gene 

•  each  exon  must  be  in  the  same  reading  frame  as  the  previous  exon 

•  every  intron  begins  with  the  dinucleotide  GT  and  ends  with  the  dinucleotide  AG. 

These  constraints  significantly  enhance  the  efficiency  of  computing  the  optimal  parse  of 
the  DNA  sequence,  by  restricting  the  search  space  of  the  DP  algorithm.  On  the  other 
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Figure  1:  Dynamic  programming  to  build  gene  models. 

hand,  genuine  frameshifts  cannot  be  detected  by  the  system. 

The  dynamic  programming  algorithm  starts  building  a  putative  gene  after  finding  a 
stop  codon  in  one  of  the  six  possible  reading  frames  (considering  both  the  forward  and 
reverse  directions).  It  attempts  to  complete  the  gene  by  searching  backwards  from  the 
stop  codon  to  the  various  possible  start  codons,  adding  exons  to  the  model  along  the 
way. 

After  identifying  all  potential  splice  sites  (described  below),  the  DP  algorithm  then 
finds  potential  exons.  At  each  potential  start  codon  or  acceptor  site  n,  the  algorithm 
decides  if  there  is  an  exon  starting  at  n.  The  algorithm  prefers  to  make  the  exon  as 
long  as  possible;  this  is  generally  a  very  useful  heuristic  for  P.  falciparum,  in  which  the 
82%  AT-content  makes  long  stretches  of  noncoding  DNA  without  a  stop  codon  very 
unlikely.  After  delimiting  the  exon,  the  algorithm  searches  downstream  for  all  optimal 
parses  ending  at  the  same  stop  codon  and  consistent  with  the  same  reading  frame  of  the 
exon  under  consideration. 

The  algorithm  is  best  illustrated  by  example,  as  shown  in  Figure  1.  Let  S'  denote  the 
stop  codon,  and  P(n,  S)  the  optimal  parse  of  the  sequence  starting  at  location  n,  and 
ending  at  the  stop  codon  S.  Suppose  we  are  at  location  n  and  we  have  identified  exon 
E.  We  want  to  find  P(n,S).  Because  the  algorithm  proceeds  backwards  from  the  stop 
codon,  it  has  already  computed  P(n',S),  an  optimal  parse  starting  at  location  n'  and 
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ending  at  S,  which  is  made  of  the  exons  E'l, ...,  E'm.  It  has  also  identified  P{n",  S )  an 
optimal  parse  starting  at  location  n"  and  ending  at  S,  made  up  of  the  exons  E"  1, E"p. 
In  the  case  of  a  gene  on  the  direct  strand,  n  <  n'  and  n  <  n".  Both  P{n',  S)  and  P(n",  S ) 
are  in  the  same  reading  frame  with  exon  E.  To  find  an  optimal  parse  P(n,  S ),  we  use  the 
IMM  obtained  in  the  training  phase  to  score  the  sequence  obtained  by  concatenating  E 
with  E'l,...,  E'm.  Likewise  we  score  the  sequence  obtained  after  concatenating  E  and 
E"l,...,E"p.  GlimmerM  will  choose  the  region  that  scores  best  as  an  optimal  parse 
P(n,  S).  In  the  case  of  equal  scores,  the  system  prefers  the  longest  gene  model. 

A  final  rule  inserted  into  the  GlimmerM  calculations  was  the  use  of  AT-content  to 
identify  exons.  It  has  been  observed  that  P.  falciparum  exons  have  an  AT-content  that 
averages  around  70-75%,  compared  to  the  genome  average  of  82%.  Thus  when  deciding 
among  alternative  gene  models,  this  statistic  sometimes  provides  additional  help.  This 
rule  was  incorporated  into  a  post-processing  step. 

After  predicting  gene  structures  in  both  direct  and  complementary  strands,  Glim¬ 
merM  makes  one  more  pass  over  the  sequence  to  reject  overlapping  genes.  Two  over¬ 
lapping  genes  are  scored  again  using  the  IMM,  and  the  one  with  the  better  score  is 
retained.  Gene  models  have  a  minimum  total  length  of  200bp  (computed  as  the  the  sum 
of  all  coding  bases);  this  value  can  be  easily  adjusted  by  the  user. 

In  order  to  avoid  mistakenly  “committing”  to  the  wrong  gene  model  too  soon  in  its 
calculations,  GlimmerM  was  designed  to  output  several  competing  models,  along  with 
their  scores,  for  a  given  subsequence.  This  allowed  the  human  annotators  to  choose 
among  models  using  other  information  (such  as  alignment  data)  when  such  information 
was  available. 
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2.3  Splice  site  identification 

The  approach  used  by  GlimmerM  to  determine  the  splice  sites  is  similar  to  the  one 
described  in  (Salzberg  et  al.  1998).  A  2nd-order  Markov  chain  model  is  used  to  score  a 
16-base  region  around  donor  sites  and  a  29-base  region  around  acceptor  sites.  Two  2nd- 
order  Markov  models  were  built  for  each  type  of  site.  First,  a  “true”  Markov  model  was 
created  from  existing  data  on  known  5’  and  3’  consensus  sites.  This  data  was  collected 
by  exhaustively  combing  the  literature  for  every  documented  exon-intron  boundary.  A 
“false”  Markov  model  was  built  from  a  large  number  of  randomly  chosen  false  splice 
sites;  i.e.,  sequences  that  contained  the  consensus  GT  or  AG  dinucleotide  but  that  were 
not  true  splice  sites.  The  score  of  a  site  s*,  si+i,  was  computed  by  each  Markov 

model  according  to  the  formula: 

k—i 

where 

MS,k  —  ln  (/(Sfc-2,Sjfe-l, «*;),*:/ f (Sk-2,Sk-l),k— l) 

and  fs,k  is  the  frequency  of  substring  s  ending  at  location  k.  Note  that  for  the  leftmost 
position  in  the  splice  site  region,  M  is  taken  to  be  the  probability  given  by  the  0th  order 
Markov  model,  and  for  the  second  position,  M  is  given  by  the  1st  order  model.  The 
score  for  a  given  splice  site  is  computed  by  taking  the  difference  of  the  scores  obtained 
from  the  “true”  site  Markov  model  and  the  “false”  site  model. 

After  building  the  models,  we  then  scored  all  the  true  splice  sites  and  a  large  selection 
of  randomly  chosen  false  sites.  We  then  set  minimum  cutoff  scores  in  order  to  correctly 
identify  most  (or  all)  true  sites,  and  measured  how  many  false  positives  we  would  expect 
with  various  thresholds. 

Figure  2  shows  the  tradeoff  in  thresholds  for  the  splice  site  recognition  function  in 
P.  falciparum.  The  figure  reflects  the  state  of  the  system  after  re- training  in  late  1998 
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Figure  2:  Tradeoff  between  false  positive  rates  and  false  negative  rates  for  the  Markov 
chain  method  that  recognizes  exon-intron  splice  sites. 
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(after  the  conclusion  of  the  chromosome  2  sequencing  project),  at  which  time  143  introns 
were  available  from  P.  falciparum.  The  figure  shows  the  tradeoff  between  sensitivity  and 
selectivity  for  the  Markov  chain  method  on  both  donor  and  acceptor  sites.  Acceptor 
sites  are  much  easier  to  recognize:  with  a  false  negative  rate  of  0%  (corresponding  to 
a  sensitivity  of  100%,  meaning  that  all  true  sites  will  be  recognized),  the  false  positive 
rate  (the  percentage  of  AG  dinucleotides  that  will  incorrectly  be  called  acceptor  sites)  is 
just  0.56%.  For  donor  sites,  a  0%  false  negative  rate  corresponds  to  a  rather  high  6.1% 
false  positive  rate.  Setting  the  system  so  that  it  misses  4  of  the  143  true  donor  sites 
(2.8%)  reduces  this  false  positive  rate  to  2.9%. 

Note  that  before  sequencing  of  chromosome  2,  only  about  90  introns  were  known 
and  the  Markov  chain  models  were  consequently  not  as  accurate  as  those  here.  The 
implication  is  that  GlimmerM  will  perform  even  better  for  subsequent  chromosomes  of 
malaria,  and  we  intend  to  continue  re-training  it  as  the  genome  sequencing  progresses. 
The  additional  introns  were  identified  with  high  confidence  by  aligning  the  genomic 
sequence  of  chromosome  2  with  EST  sequences  that  spanned  introns  and  with  protein 
sequences  from  other  organisms.  The  statistics  presented  in  the  results  section  represent 
the  performance  of  GlimmerM  after  re-training  to  improve  the  donor  and  acceptor  site 
Markov  models. 

2.4  Code  availability 

The  complete  source  code  for  GlimmerM  will  be  made  available  soon;  it  has  already 
been  shared  with  other  malaria  genome  sequencing  centers.  The  code  includes  routines 
for  re-training  the  system  on  data  from  other  organisms.  A  version  of  the  system  trained 
on  Arabidopsis  thaliana  genes  is  currently  under  development.  Total  processing  time  to 
find  all  genes  in  malaria  chromosome  2  (approximately  one  million  nucleotides)  is  about 
half  an  hour  on  a  Pentium  450  personal  computer  running  Linux. 
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2.5  Annotating  a  genome 

In  its  current  form,  GlimmerM  produces  multiple  gene  models  for  some  genes.  A 
model  is  a  set  of  exons  that,  when  concatenated  together,  form  a  single  protein.  When 
no  database  matches  were  found  to  support  a  GlimmerM  prediction,  the  chromosome 
2  annotation  reflects  the  highest  scoring  model.  Although  many  of  these  are  likely  to  be 
correct,  it  is  undoubtedly  the  case  that  some  are  not.  Further  investigation  is  required 
to  confirm  these  predictions  (but  see  below  for  laboratory  evidence  confirming  a  small 
subset). 

The  GlimmerM  algorithm  was  used  as  one  of  a  suite  of  tools.  Accurate  gene  iden¬ 
tification  depends  on  using  every  tool  available,  and  the  description  here  should  not  be 
taken  as  implying  that  GlimmerM  alone  can  find  all  genes  in  P.  falciparum  or  any 
other  genome.  However,  it  was  a  central  component  in  a  larger  strategy.  Other  impor¬ 
tant  computational  tools  used  by  the  malaria  chromosome  2  team  were:  (1)  searches 
of  a  nonredundant  protein  sequence  database  using  gapped  BLAST  and  PSI-BLAST 
(Altschul  et  al.  1990;  Altschul  et  al.  1997);  (2)  gapped  alignments  of  DNA  to  protein 
and  EST  sequence  databases  using  DDS  and  DPS  (Huang  et  al.  1997);  (3)  prediction 
of  putative  signal  peptides  using  SignalP  (Nielsen  et  al.  1997);  (4)  prediction  of  trans¬ 
membrane  domains  with  PHThtm  (Rost  et  al.  1995);  and  (5)  prediction  of  nonglobular 
structures  with  SEG  (Wootton  and  Federhen  1996).  In  addition,  the  project  used  addi¬ 
tional  aligment  tools  developed  at  TIGR  to  detect  frameshift  errors:  these  tools  allow 
an  annotator  to  detect  when  a  sequence  alignment  extends  beyond  the  start  and  stop 
codons  indicated  by  other  tools.  In  some  cases  this  indicates  errors  in  sequencing,  which 
can  be  corrected;  in  other  cases  it  indicates  either  a  genuine  frameshift  that  occurs  dur¬ 
ing  translation  or  a  mutation  that  has  changed  the  length  of  the  translated  protein.  Any 
comprehensive  annotation  effort  needs  these  computational  tools  and  more  in  order  to 
produce  reasonably  accurate  gene  annotations. 
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3  Results  and  discussion 


GlimmerM  was  used  as  the  primary  gene  finder  for  chromosome  2  of  P.  falciparum, 
for  which  it  finds  207/209  (99%)  genes  automatically,  as  detailed  below.  In  addition 
to  GlimmerM,  the  annotation  process  used  other  computational  tools  as  mentioned 
above.  Chromosome  2  has  209  protein-coding  genes  spread  over  approximately  one 
million  bases,  for  a  gene  density  of  one  gene  per  4.5  kb.  This  contrasts  with  a  density 
of  1/kb  in  bacteria,  l/2kb  in  yeast,  l/7kb  in  C.  elegans,  and  l/50kb  (estimated)  in 
human.  Of  the  209  protein-coding  genes,  43%  had  at  least  one  intron  and  those  genes 
with  introns  usually  had  just  one  or  two  introns  (Gardner  et  al.  1998). 

3.1  Training 

In  order  to  train  the  IMM,  we  needed  to  collect  as  much  coding  sequence  as  possible  from 
P.  falciparum  itself.  We  exhaustively  surveyed  the  literature  to  collect  every  sequence, 
including  partial  genes,  that  was  backed  by  laboratory  evidence.  Our  survey  collected 
110  complete  coding  sequences  from  all  14  chromosomes,  of  which  just  6  came  from 
chromosome  2.  (Note  that  by  length,  chromosome  2  comprises  approximately  3%  of  the 
genome.)  This  training  set  provided  the  data  for  the  splice  site  models  described  above 
as  well. 

An  important  point  to  emphasize  here  is  that  P.  falciparum  has  an  unusually  high 
82%  AT  content.  As  a  consequence  of  this  high  AT  content,  stop  codons  are  very 
frequent  (e.g.,  TAA  will  occur  especially  often)  in  noncoding  DNA.  This  makes  it  much 
more  likely  that  long  open  reading  frames  (ORFs)  represent  coding  sequence.  This  fact 
was  used  to  generate  additional  training  data  for  GlimmerM:  ORFs  greater  than  500bp 
in  the  chromosome  2  sequence  were  assumed  to  be  coding  regions,  and  were  used  in  the 
IMM  training.  These  were  added  to  the  list  generated  by  the  literature  search. 
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3.2  Accuracy 

Of  the  209  genes,  GlimmerM  finds  178  exactly;  i.e.,  it  identifies  the  correct  start 
codon,  the  correct  boundaries  of  every  exon  and  intron,  and  the  correct  stop.  Of  these, 
40  have  competing  gene  models  that  score  higher,  meaning  that  a  human  annotator 
had  to  examine  the  output  and  decide  which  gene  model  looked  best.  This  process  was 
frequently  made  easier  by  the  existence  of  EST  or  protein  matches. 

Of  the  remaining  31  genes,  GlimmerM  finds  the  stop  codons  correctly  for  14  of 
these.  Different  starts  appear  in  the  final  annotation  for  several  reasons;  for  example, 
the  existence  of  a  match  to  a  protein  sequence  that  starts  at  a  different  start  codon. 
(Note  that  it  is  entirely  possible  that  GlimmerM  is  still  correct  in  these  cases.)  In  rare 
cases,  a  protein  hit  produces  a  coding  region  that  contains  a  stop  codon;  these  indicate 
genuine  frameshifts  and  are  annotated  as  such  in  the  database.  The  system  finds  the 
correct  start  but  the  wrong  stop  codon  for  four  genes;  this  occurs  in  multi-exon  genes 
in  which  a  splice  site  is  missed  and  one  of  the  exons  is  incorrectly  extended  until  it  hit 
a  stop  codon.  The  11  remaining  partial  hits  are  cases  in  which  GlimmerM  predicts 
some  but  not  all  exons  correctly;  for  example,  three  multi-exon  genes  are  each  broken 
into  two  separate  genes. 

Only  two  of  the  209  genes  are  missed  completely.  One  is  a  predicted  integral  mem¬ 
brane  protein  of  192aa  predicted  by  the  original  version  of  GlimmerM  (before  re¬ 
training  the  splice  site  models).  A  separate  program  was  used  to  predict  the  function  of 
this  protein;  it  did  not  align  to  any  known  sequences.  The  second  is  ribosomal  protein 
S30;  ribosomal  proteins  often  have  a  strikingly  different  composition  from  other  genes 
and  are  known  to  be  difficult  for  content-based  gene  finders  to  locate.  These  will  not 
be  missed  as  long  as  genomic  data  is  searched  against  databases  of  known  ribosomal 
proteins. 

The  improved  splice  site  Markov  models  resulted  in  GlimmerM’s  generating  41 
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fewer  gene  models  than  before.  In  addition  to  the  one  missed  gene  just  described,  it 
generated  five  new  gene  models.  Of  these,  one  appears  to  be  a  secreted  protein,  and  we 
are  currently  investigating  this  to  see  if  it  should  be  added  to  the  published  annotation. 

A  significant  caveat  to  include  with  these  results  is  that  GlimmerM  often  produces 
multiple  competing  models  that  the  human  annotator  must  resolve.  Most  genes  with 
three  or  more  exons  result  in  multiple  models.  The  system  indicates  which  model  scores 
the  highest,  but  as  indicated  above,  40  of  the  “correct”  gene  models  had  alternative 
parses  that  scored  higher.  These  alternative  parses  share  some  exons  but  use  different 
splice  sites  for  others.  A  human  annotator,  looking  at  additional  evidence,  was  able  to 
overrule  the  system’s  top  choice  in  these  cases.  It  is  likely  that  in  other  cases  where  no 
evidence  besides  GlimmerM’s  prediction  is  available,  some  of  the  published  annotation 
may  still  be  in  error  (all  such  proteins  are  annotated  as  hypotheticals).  After  collapsing 
each  set  of  multiple  gene  models  into  one  model,  the  gene  list  still  contains  266  genes. 
These  means  that  (since  only  209  genes  appeared  in  the  final  annotation)  the  annota¬ 
tors  eliminated  another  57  gene  models  entirely  from  the  output.  These  decisions  were 
somewhat  subjective:  frequently  the  putative  genes  were  short,  or  they  consisted  mostly 
of  low-complexity  sequence,  and  this  was  not  enough  to  convince  the  human  annotators 
that  the  genes  were  real.  In  many  cases  the  annotators  are  probably  correct,  but  it  is 
simply  impossible  at  this  point  to  say  with  confidence  that  all  of  the  deleted  genes  are 
false  positives.  Only  further  evidence  will  allow  us  to  decide,  but  this  makes  clear  the 
importance  of  continuing  to  update  and  improve  genome  annotation  over  time. 

3.3  Performance  on  known  genes 

Another  way  to  assess  the  accuracy  of  the  program  is  to  consider  its  accuracy  on  those 
proteins  whose  exon-intron  structure  is  known  precisely  from  laboratory  studies.  There 
are  seven  genes  from  chromosome  2  of  P.  falciparum  that  currently  fit  into  this  category; 
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Table  1:  Performance  of  GlimmerM  on  genes  whose  structure  is  completely  known  from 
independent  laboratory  evidence.  All  seven  genes  had  perfect  matches  to  the  system’s 
predictions,  meaning  that  the  start  codon,  stop  codon,  and  every  splice  site  were  correctly 
predicted.  The  column  headings  give  the  gene  name,  its  length  in  amino  acids,  number 
of  introns  (Intr),  a  comment  on  GlimmerM’s  prediction,  and  the  common  name  of  the 
protein.  _ _ 


Name 

Len 

Intr 

Comment 

Common  name 

PFBOlOOc 

654 

1 

Perfect  match 

knob-associated  His-rich  prt 

PFB0295w 

471 

0 

Perfect  match 

adenylosuccinate  lyase  (00) 

PFB0300C 

272 

0 

Perfect  match 

merozoite  surface  antigen  MSP-2 

PFB0305C 

272 

1 

Perfect  match 

merozoite  surface  antigen  MSP-5 
(EGF  domain) 

PFB0310c 

272 

1 

Perfect  match,  highest 
score  from  5  models 

merozoite  surface  antigen 

MSP-4  (EGF  domain) 

PFB0340c 

997 

3 

Perfect  match,  differed 
at  one  splice  site 
(see  main  text) 

SERA  antigen/papain-like 
protease  with  active  Ser 

PFB0405w 

3135 

0 

Perfect  match,  higher 
score  from  2  models 

transmission  blocking 
target  antigen  PfS230 

i.e.,  the  sequence  from  start  to  stop  has  been  completely  characterized.  Of  these  seven, 
one  (PFBOlOOc)  was  published  subsequent  to  the  construction  of  our  training  set;  the 
other  six  were  included  in  the  training  data. 

GlimmerM’s  performance  on  this  small  set  of  genes  is  shown  in  Table  1.  For  all 
seven  of  the  genes,  GlimmerM’s  output  contained  a  model  that  matched  perfectly. 
For  four  of  the  genes,  there  correct  model  was  the  only  one  output  by  the  system. 
For  PFB0310c  and  PFB0405c,  GlimmerM  produced  five  and  two  competing  models 
respectively,  but  in  each  case  the  highest  scoring  one  was  correct.  Only  for  PFB0340c, 
a  4-exon  gene,  was  GlimmerM’s  highest  scoring  model  not  the  right  one.  The  system 
did  produce  the  correct  answer,  but  it  gave  a  slightly  higher  score  to  a  model  that  used 
a  different  donor  site  for  the  first  exon.  GlimmerM’s  alternate  prediction  would  have 
a  23aa  insertion  in  this  997aa  protein. 
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3.4  Laboratory  tests 

The  only  way  of  measuring  the  accuracy  of  GlimmerM  precisely  is  to  test  each  of  its 
predictions  in  the  laboratory  to  see  if  they  are  expressed  as  predicted.  Alternatively,  for 
those  genes  with  significant  homology  to  genes  from  other  organisms,  the  homologous 
gene  can  be  used  as  an  independent  confirmation  of  the  prediction.  For  chromosome 
2,  90  predicted  proteins  (43%)  have  no  homolog  in  the  existing  databases  (Gardner 
et  al.  1998),  and  the  task  of  testing  all  of  these  predictions  has  not  yet  been  conducted. 
Over  time,  we  expect  some  of  them  to  show  up  as  homologs,  obviating  the  need  for 
laboratory  experiments.  However,  one  careful  set  of  experiments  was  conducted  as  part 
of  the  chromosome  2  study. 

Because  many  of  the  proteins  predicted  by  GlimmerM  had  unusual  nonglobular 
domains,  the  chromosome  2  project  team  ran  a  reverse  transcriptase  (RT-PCR)  experi¬ 
ment  for  13  of  these  genes  (Gardner  et  al.  1998)  to  determine  whether  or  not  they  were 
real.  The  RT-PCR  focused  its  attention  on  nonglobular  domains,  not  entire  proteins, 
so  it  could  not  confirm  every  detail  of  the  GlimmerM  predictions.  This  experiment 
confirmed  that  all  13  of  the  nonglobular  domains  were  expressed;  i.e.,  the  predictions 
for  those  regions  were  correct.  To  our  knowledge,  this  is  the  first  time  ever  that  compu¬ 
tational  predictions  provided  the  impetus  for  experiments  which  in  turn  confirmed  the 
predictions. 
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